Business Cases for Data Science

Business Case 3 - Instacart Market Basket Analysis Logo

Group AA

Members:

  • Emil Ahmadov (m20201004@novaims.unl.pt)
  • Doris Macean (m20200609@novaims.unl.pt)
  • Doyun Shin (m20200565@novaims.unl.pt)
  • Anastasiia Tagiltseva (m20200041@novaims.unl.pt)

1. Business Understanding

Company wants to see overview of Instacart's business:

  • What are the main types of consumer behavior in the business?
  • Which types of products should have an extended amount of product offerings?
  • Which types of products can be seen as substitutes?
  • Which items are complementary?

Project Plan

Phase Time Resources Risks
Business Understanding 1 day All analysts Economic and market changes
Data Understanding 1 day All analysts Data problems, technological problems
Data Preparation 2 days Data scientists, DB engineers Data problems, technological problems
Modeling 1 day Data scientists Technological problems, inability to build adequate model
Evaluation 1 day All analysts Economic change inability to implement results
Deployment 1 day Data scientists, DB engineers, implementation team Economic change inability to implement results

2. Data Understanding

Datasets description

departments.csv

  • department_id: numeric - ID of the department
  • department: categorical - department name

order_products.csv

  • order_id: numeric - ID of the order
  • product_id: numeric - ID of the product
  • add_to_cart_order: numeric - the sequence in which the product is added to the cart in the order
  • reordered: numeric - indicates that the customer has a previous order that contains the product

orders.csv

  • order_id: numeric - ID of the order
  • user_id: numeric - ID of the customer
  • order_number: numeric - number of the order
  • order_dow: numeric - the day of week
  • order_hour_of_day: numeric - the hour of the day
  • days_since_prior_order: numeric - how many days have passed since prior order

products.csv

  • product_id: numeric - ID of the product
  • department_id: numeric - ID of the department
  • product_name: categorical - product name

2.1 Exploratory Data Analysis

In [1]:
# Data Processing
import numpy as np 
import pandas as pd 

# Apriori & Recommendation
import mlxtend as ml
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from sklearn.metrics.pairwise import cosine_similarity

#Visualization & Clustering
import seaborn as sns
color = sns.color_palette()
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure, text
import networkx as nx
import plotly.express as px
from sklearn.cluster import KMeans, AgglomerativeClustering
import plotly.graph_objs as go


# Utilities and others
import os
import warnings
warnings.filterwarnings("ignore")
In [2]:
print(nx.__version__)
print(matplotlib.__version__)
print(np.__version__)
2.5.1
3.3.4
1.19.2
In [3]:
# read a csv files from current directory into DataFrame and save it to a variables
orders = pd.read_csv(os.path.join(os.getcwd(),'orders.csv'))
products = pd.read_csv(os.path.join(os.getcwd(),'products.csv'))
order_products = pd.read_csv(os.path.join(os.getcwd(),'order_products.csv'))
departments = pd.read_csv(os.path.join(os.getcwd(),'departments.csv'))
In [4]:
# Check the number of unique orders and unique products
orders_Unique = len(set(order_products.order_id))
products_Unique = len(set(order_products.product_id))
print("There are %s orders for %s products" %(orders_Unique, products_Unique))
#customers
print("Number of unique customers in the whole dataset : ",len(set(orders.user_id)))
There are 200000 orders for 134 products
Number of unique customers in the whole dataset :  105273
In [5]:
#Days of Orders in a week
grouped = orders.groupby("order_id")["order_dow"].aggregate("sum").reset_index()
grouped = grouped.order_dow.value_counts()

sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(15, 6))
sns.barplot(grouped.index, grouped.values, color = '#ff8200')
plt.ylabel('Number of orders', fontsize=13)
plt.xlabel('Days of order in a week', fontsize=13)
ax.set_facecolor('#fff0db')
plt.show()
In [6]:
#Hours of Order in a Day
grouped = orders.groupby("order_id")["order_hour_of_day"].aggregate("sum").reset_index()
grouped = grouped.order_hour_of_day.value_counts()

sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(15, 6))
sns.barplot(grouped.index, grouped.values, color = '#43b02a')
plt.ylabel('Number of orders', fontsize=13)
plt.xlabel('Hours of order in a day', fontsize=13)
ax.set_facecolor('#fff0db')
plt.show()
In [7]:
grouped_df = orders.groupby(["order_dow", "order_hour_of_day"])["order_number"].aggregate("count").reset_index()
grouped_df = grouped_df.pivot('order_dow', 'order_hour_of_day', 'order_number')

plt.figure(figsize=(12,6))
sns.heatmap(grouped_df, cmap="Greens")
plt.title("Frequency of Day of week Vs Hour of day")
plt.show()
In [8]:
#Period of Reorders
grouped = orders.groupby("order_id")["days_since_prior_order"].aggregate("sum").reset_index()
grouped = grouped.days_since_prior_order.value_counts()

from matplotlib.ticker import FormatStrFormatter
f, ax = plt.subplots(figsize=(15, 6))
sns.barplot(grouped.index, grouped.values, color = '#ff8200')
ax.xaxis.set_major_formatter(FormatStrFormatter('%.0f'))
plt.ylabel('Number of orders', fontsize=13)
plt.xlabel('Period of reorder', fontsize=13)
ax.set_facecolor('#fff0db')
plt.show()

# The chart indicates Weekly and Monthly orders are the most popular.
In [9]:
#Do people usually reorder the same previous ordered products ?
grouped = order_products.groupby("reordered")["product_id"].agg(Total = 'count').reset_index()
grouped['Ratios'] = round(grouped["Total"].apply(lambda x: x /grouped["Total"].sum())*100,2)
grouped
#59 % of ordered products are previously ordered by customers
Out[9]:
reordered Total Ratios
0 0 828515 41.03
1 1 1190986 58.97

Now that we have seen 59% of the products are re-ordered, there will also be situations when none of the products are re-ordered. Let us check that now.

In [10]:
grouped_df = order_products.groupby("order_id")["reordered"].agg("sum").reset_index()
grouped_df["reordered"].loc[grouped_df["reordered"]>1] = 1
grouped_df.reordered.value_counts() / grouped_df.shape[0]
Out[10]:
1    0.88172
0    0.11828
Name: reordered, dtype: float64

About 12% of the orders has no re-ordered items

In [11]:
grouped = order_products.groupby("order_id")["add_to_cart_order"].agg("max").reset_index()
grouped = grouped.add_to_cart_order.value_counts()

sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(15, 8))
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values, color = '#43b02a')
plt.title("Number of products bought in each order", fontsize=15)
ax.set_facecolor('#fff0db')

plt.ylabel('Number of Orders', fontsize=13)
plt.xlabel('Number of products added in order', fontsize=13)
plt.show()

A right tailed distribution with the maximum value at 5

In [12]:
# Rank the top 10 best-selling items
grouped = order_products.groupby("product_id")["reordered"].agg(frequency_count = 'count').reset_index()
grouped = pd.merge(grouped, products[['product_id', 'product_name']], how='left', on=['product_id'])
percent = grouped.product_name.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
grouped = grouped.sort_values(by= 'frequency_count', ascending=False)[:10]
grouped[['product_name','frequency_count']].reset_index(drop=True)
Out[12]:
product_name frequency_count
0 fresh fruits 226039
1 fresh vegetables 212611
2 packaged vegetables fruits 109596
3 yogurt 90751
4 packaged cheese 61502
5 milk 55150
6 water seltzer sparkling water 52564
7 chips pretzels 45306
8 soy lactosefree 39389
9 bread 36381
In [13]:
grouped  = grouped.groupby(['product_name']).sum()['frequency_count'].sort_values(ascending=False)

sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(15, 6))
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values, color = '#ff8200')
plt.ylabel('Number of Reorders', fontsize=13)
plt.xlabel('Most ordered Products', fontsize=13)
ax.set_facecolor('#fff0db')
plt.show()
In [14]:
#Which products are usually reordered
grouped = order_products.groupby("product_id")["reordered"].agg(reorder_sum = 'sum', reorder_total = 'count').reset_index()
grouped['reorder_probability'] = grouped['reorder_sum'] / grouped['reorder_total']
grouped = pd.merge(grouped, products[['product_id', 'product_name']], how='left', on=['product_id'])
grouped = grouped[grouped.reorder_total > 75].sort_values(['reorder_probability'], ascending=False)[:10]
grouped
Out[14]:
product_id reorder_sum reorder_total reorder_probability product_name
83 84 43162 55150 0.782629 milk
114 115 38467 52564 0.731813 water seltzer sparkling water
23 24 162355 226039 0.718261 fresh fruits
85 86 19798 27986 0.707425 eggs
90 91 27251 39389 0.691843 soy lactosefree
31 32 12021 17408 0.690545 packaged produce
52 53 13625 19786 0.688618 cream
119 120 62464 90751 0.688301 yogurt
111 112 24540 36381 0.674528 bread
30 31 23854 35893 0.664586 refrigerated
In [15]:
grouped  = grouped.groupby(['product_name']).sum()['reorder_probability'].sort_values(ascending=False)

sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(15, 6))
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values, color = '#43b02a')
plt.ylim([0.65,0.80])
plt.ylabel('Reorder probability', fontsize=13)
plt.xlabel('Most reordered products', fontsize=12)
ax.set_facecolor('#fff0db')
plt.show()
In [16]:
#obtaining user, order and product detailed info for prior set
user_order_products=pd.merge(orders,order_products, on='order_id',how='left')
user_order_products_all_details=pd.merge(user_order_products,products,on='product_id',how='left')
In [17]:
#extracting order_dow_hod_reord_count, the number of reorders for a particular day of the week and particular hour of the day
order_dow_hod_reord_count=user_order_products_all_details.groupby(['order_dow','order_hour_of_day']).agg({'reordered':sum}).reset_index()
order_dow_hod_reord_count.columns=['order_dow','order_hour_of_day','order_dow_hod_reord_count']

#extracting order_dow_hod_reord_prop, the proportion of reorders for a particular day of the week and particular hour of the day
order_dow_hod_reord_prop=pd.DataFrame(order_dow_hod_reord_count)
order_dow_hod_reord_prop['order_dow_hod_reord_cnt']=user_order_products_all_details.groupby(['order_dow','order_hour_of_day']).agg({'reordered':'count'}).reset_index().reordered
order_dow_hod_reord_prop['order_dow_hod_reord_prop']=order_dow_hod_reord_prop['order_dow_hod_reord_count']/order_dow_hod_reord_prop['order_dow_hod_reord_cnt']
order_dow_hod_reord_prop.drop(['order_dow_hod_reord_count','order_dow_hod_reord_cnt'],axis=1,inplace=True)

#Plotting heatmap illustrating reordering probability in order_dow, order_hour_of_day space
plt.figure(figsize=(15,6))
sns.heatmap(order_dow_hod_reord_count.pivot('order_dow','order_hour_of_day','order_dow_hod_reord_prop'),annot=True,cmap="Greens")
plt.title("Reorder ratio of Day of week Vs Hour of day")
plt.show()

From the above plot, we can observe that probability of reordering is more on 6st day of the week 5th hour of day, followed by 1st day of week 5th hour of day

In [18]:
# products bought distribution by department
combdf = pd.merge(order_products,products, on="product_id")
combdf = pd.merge(combdf,departments, on ='department_id')

plt.figure(figsize=(10,10))
temp_series = combdf['department'].value_counts()
labels = (np.array(temp_series.index))
sizes = (np.array((temp_series / temp_series.sum())*100))
plt.pie(sizes, labels=labels, 
        autopct='%1.1f%%', startangle=200)
plt.title("Departments distribution", fontsize=15)
plt.show()
In [19]:
#reorder ratio by department
grouped_df = combdf.groupby(["department"])["reordered"].aggregate("mean").reset_index()
grouped_df

plt.figure(figsize=(12,8))
sns.pointplot(grouped_df['department'].values, grouped_df['reordered'].values, alpha=0.8,color='#ff8200')
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Department', fontsize=12)
plt.title("Department wise reorder ratio", fontsize=15)
plt.xticks(rotation='vertical')
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [20]:
grouped_df = combdf.groupby(["product_name"])["reordered"].aggregate("mean").reset_index()
grouped_df

plt.figure(figsize=(20,8))
sns.pointplot(grouped_df['product_name'].values, grouped_df['reordered'].values, alpha=0.8, color = '#43b02a')
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Product', fontsize=12)
plt.title("Product wise reorder ratio", fontsize=15)
plt.xticks(rotation='vertical')
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()

#values with low reorder rate are likely to be one-shot purchases for a lot of customers
In [21]:
grouped_df = user_order_products_all_details.groupby(["order_dow"])["reordered"].aggregate("mean").reset_index()

plt.figure(figsize=(12,8))
sns.barplot(grouped_df['order_dow'].values, grouped_df['reordered'].values, alpha=0.8, color ='#ff8200')
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.title("Reorder ratio across day of week", fontsize=15)
plt.xticks(rotation='vertical')
plt.ylim(0.5, 0.7)
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [22]:
grouped_df = user_order_products_all_details.groupby(["order_hour_of_day"])["reordered"].aggregate("mean").reset_index()

plt.figure(figsize=(12,8))
sns.barplot(grouped_df['order_hour_of_day'].values, grouped_df['reordered'].values, alpha=0.8, color = '#43b02a')
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.title("Reorder ratio across hour of day", fontsize=15)
plt.xticks(rotation='vertical')
plt.ylim(0.5, 0.7)
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [23]:
#plotting pdf of order_number
my_pal = {"#ff8200",'#43b02a'}
sns.FacetGrid(user_order_products_all_details,hue='reordered',height=6, palette=my_pal).map(sns.distplot,'order_number',bins=20)\
.set(xticks=range(0,101,5))\
.add_legend()
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()

#plotting boxplot of order_number
sns.boxplot(x='reordered',y='order_number',data=user_order_products_all_details, palette=my_pal)
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()

From the above plot, we can observe that when the order number is less than 10 the non-reordered class dominates and when order number is greater than 10, the reordered class starts to dominate.

In [24]:
#plotting PDF of first 50 add_to_cart_order for better view
my_pal = {"#ff8200",'#43b02a'}
sns.FacetGrid(user_order_products_all_details[user_order_products_all_details.add_to_cart_order<=50],hue='reordered',height=6,palette=my_pal)\
.map(sns.distplot,'add_to_cart_order',bins=200)\
.add_legend()
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()

#plotting boxplot of add_to_cart_order
sns.boxplot(x='reordered',y='add_to_cart_order',data=user_order_products_all_details,palette=my_pal).set(ylim=(0,80))
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()

From the above plot we can observe that when add_to_cart_order is less than or equal to 5 , the number of reorders dominate while after 5, the number of non-reorders dominate.

In [25]:
# probability of reordering
my_pal = {"#ff8200",'#43b02a'}
temp = user_order_products_all_details.groupby(['add_to_cart_order','reordered'])
temp=temp.size().unstack()
temp =temp.head(30)
temp.plot(kind = "bar", rot = 90, stacked = False, color=my_pal)
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [26]:
#most reordered product 
temp = user_order_products_all_details[user_order_products_all_details.reordered == 1]
temp = temp.product_name.value_counts()
temp = temp.head(30)
plt.figure(figsize=(15,8))
temp.plot(kind = "bar", rot = 90, stacked = False, color='#ff8200')
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [27]:
grouped = orders.groupby('user_id')['order_id'].apply(lambda x: len(x.unique())).reset_index()
grouped = grouped.groupby('order_id').aggregate("count")

sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(15, 6))
ax.set_facecolor('#fff0db')
sns.barplot(grouped.index, grouped.user_id, color = '#43b02a')
plt.ylabel('Numbers of Customers')
plt.xlabel('Number of Orders per customer')
plt.xticks(rotation='vertical')
plt.show()
In [28]:
items  = pd.merge(left=products, right=departments, how='left')
grouped = items.groupby("department")["product_id"].agg(Total_products = 'count').reset_index()
grouped['Ratio'] = round(grouped["Total_products"].apply(lambda x: x /grouped['Total_products'].sum())*100,2)
grouped.sort_values(by='Total_products', ascending=False, inplace=True)
grouped.reset_index(drop=True)
Out[28]:
department Total_products Ratio
0 personal care 17 12.69
1 pantry 12 8.96
2 frozen 11 8.21
3 snacks 11 8.21
4 dairy eggs 10 7.46
5 household 10 7.46
6 beverages 8 5.97
7 meat seafood 7 5.22
8 alcohol 5 3.73
9 canned goods 5 3.73
10 deli 5 3.73
11 dry goods pasta 5 3.73
12 produce 5 3.73
13 bakery 5 3.73
14 breakfast 4 2.99
15 babies 4 2.99
16 international 4 2.99
17 bulk 2 1.49
18 pets 2 1.49
19 other 1 0.75
20 missing 1 0.75
In [29]:
grouped  = grouped.groupby(['department']).sum()['Total_products'].sort_values(ascending=False)

sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(15, 6))
ax.set_facecolor('#fff0db')
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values,color='#ff8200')
plt.ylabel('Number of products', fontsize=13)
plt.xlabel('Departments', fontsize=13)
plt.show()
In [30]:
users_flow = orders[['user_id', 'order_id']].merge(order_products[['order_id', 'product_id']],
                                          how='inner', left_on='order_id', right_on='order_id')

users_flow = users_flow.merge(items, how='inner', left_on='product_id',
                                         right_on='product_id')
In [31]:
grouped = users_flow.groupby("department")["order_id"].agg( Total_orders = 'count').reset_index()
grouped['Ratio'] = round(grouped["Total_orders"].apply(lambda x: x /grouped['Total_orders'].sum())*100,2)
grouped.sort_values(by='Total_orders', ascending=False, inplace=True)
grouped.reset_index(drop=True)
Out[31]:
department Total_orders Ratio
0 produce 588996 29.17
1 dairy eggs 336915 16.68
2 snacks 180692 8.95
3 beverages 168126 8.33
4 frozen 139536 6.91
5 pantry 116262 5.76
6 bakery 72983 3.61
7 canned goods 66053 3.27
8 deli 65176 3.23
9 dry goods pasta 54054 2.68
10 household 46446 2.30
11 breakfast 44605 2.21
12 meat seafood 44271 2.19
13 personal care 28134 1.39
14 babies 25940 1.28
15 international 16738 0.83
16 alcohol 9439 0.47
17 pets 6013 0.30
18 missing 4749 0.24
19 other 2240 0.11
20 bulk 2133 0.11
In [32]:
grouped  = grouped.groupby(['department']).sum()['Total_orders'].sort_values(ascending=False)

f, ax = plt.subplots(figsize=(15, 6))
ax.set_facecolor('#fff0db')
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values,color = '#43b02a')
plt.ylabel('Number of Orders', fontsize=13)
plt.xlabel('Departments', fontsize=13)
plt.show()
In [33]:
filtered = combdf[['order_id','department']]
ohe_filtered = pd.get_dummies(filtered)

corr = ohe_filtered.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(145, 300, s=60, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [34]:
#top-10 most ordering customers
temp = orders.merge(order_products, on='order_id')
pd.DataFrame(temp.groupby('user_id')['product_id'].count()).sort_values('product_id', ascending=False).head(10)
Out[34]:
product_id
user_id
176478 460
129928 405
126305 384
201268 347
115495 283
100330 271
31903 270
15503 258
105213 245
203166 240
In [35]:
add_to_cart_order_reordered_ratio=order_products[order_products['add_to_cart_order']<30]
add_to_cart_order_reordered_ratio = add_to_cart_order_reordered_ratio.groupby('add_to_cart_order')['reordered'].mean().reset_index()
plt.figure(figsize=(25,8))
sns.pointplot(add_to_cart_order_reordered_ratio.add_to_cart_order, add_to_cart_order_reordered_ratio.reordered, color = '#ff8200' )
plt.title('Add To Cart Order vs. Reorder Ratio', fontsize=16)
plt.xlabel('Add To Cart Order', fontsize=16)
plt.xticks(fontsize=15)
plt.xticks(rotation='vertical')
plt.ylabel('Reorder Ratio', fontsize=16)
plt.yticks(fontsize=15);
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [36]:
# top 15 first choices
grouped = user_order_products_all_details[user_order_products_all_details['add_to_cart_order'] == 1].groupby(['product_name'])['order_id'].agg(['count']).sort_values(by='count',ascending=False) 
grouped[:15]
Out[36]:
count
product_name
fresh fruits 27806
fresh vegetables 14858
milk 11834
water seltzer sparkling water 10148
packaged vegetables fruits 8663
yogurt 8284
soy lactosefree 5734
refrigerated 4643
packaged cheese 4163
soft drinks 3893
bread 3875
eggs 3847
chips pretzels 3423
packaged produce 3178
ice cream ice 3134
In [37]:
# top 15 second choices
grouped = user_order_products_all_details[user_order_products_all_details['add_to_cart_order'] == 2].groupby(['product_name'])['order_id'].agg(['count']).sort_values(by='count',ascending=False) 
grouped[:15] 
Out[37]:
count
product_name
fresh fruits 26382
fresh vegetables 16320
packaged vegetables fruits 9360
yogurt 8917
milk 8459
water seltzer sparkling water 7660
soy lactosefree 5049
packaged cheese 4567
refrigerated 4234
bread 4034
eggs 3720
chips pretzels 3553
soft drinks 3177
packaged produce 3047
ice cream ice 2850
In [38]:
# groupby product frequency and average position in cart 
user_order_products_all_details.groupby(['product_name'])['add_to_cart_order'].agg(['count','mean']).sort_values(by='count',ascending=False) 
Out[38]:
count mean
product_name
fresh fruits 226039 7.160176
fresh vegetables 212611 8.840718
packaged vegetables fruits 109596 8.452434
yogurt 90751 7.942381
packaged cheese 61502 9.116256
... ... ...
kitchen supplies 561 8.766488
baby bath body care 515 10.897087
baby accessories 504 9.250000
beauty 387 9.431525
frozen juice 279 10.243728

134 rows × 2 columns

In [39]:
user_order_products_all_details.groupby(['product_name'])['add_to_cart_order'].agg(['count','mean']).sort_values(by='mean',ascending=True).head(15) 
Out[39]:
count mean
product_name
specialty wines champagnes 614 4.630293
spirits 1795 4.694150
packaged produce 17408 5.102999
beers coolers 3002 5.167555
white wines 1893 5.211833
milk 55150 5.565204
red wines 2135 5.680562
water seltzer sparkling water 52564 5.983259
eggs 27986 6.426356
soft drinks 22428 6.482745
soy lactosefree 39389 6.743736
cream 19786 6.821136
energy sports drinks 6638 7.033745
fresh fruits 226039 7.160176
coffee 12823 7.223427
In [40]:
user_order_products_all_details.groupby(['product_name'])['add_to_cart_order'].apply(lambda x: x.mode().iloc[0]).sort_values(ascending=True).head(15) 
Out[40]:
product_name
specialty cheeses                1
refrigerated                     1
refrigerated pudding desserts    1
seafood counter                  1
ice cream ice                    1
hot dogs bacon sausage           1
hot cereal pancake mixes         1
honeys syrups nectars            1
white wines                      1
grains rice dried goods          1
juice nectars                    1
shave needs                      1
soap                             1
frozen pizza                     1
frozen meat seafood              1
Name: add_to_cart_order, dtype: int64
In [41]:
grouped = order_products.groupby("product_id")["reordered"].agg(frequency_count = 'count').reset_index()
grouped = pd.merge(grouped, products[['product_id', 'product_name']], how='left', on=['product_id'])
percent = grouped.product_name.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
grouped = grouped.sort_values(by= 'frequency_count', ascending=False)
grouped[['product_name','frequency_count']].reset_index(drop=True)
Out[41]:
product_name frequency_count
0 fresh fruits 226039
1 fresh vegetables 212611
2 packaged vegetables fruits 109596
3 yogurt 90751
4 packaged cheese 61502
... ... ...
129 kitchen supplies 561
130 baby bath body care 515
131 baby accessories 504
132 beauty 387
133 frozen juice 279

134 rows × 2 columns

In [42]:
# Import packages
import matplotlib.pyplot as plt
%matplotlib inline
# Define a function to plot word cloud
def plot_cloud(wordcloud):
    # Set figure size
    plt.figure(figsize=(40, 30))
    # Display image
    plt.imshow(wordcloud)
    # No axis details
    plt.axis("off");
 
 
data = dict(zip(grouped['product_name'].tolist(), grouped['frequency_count'].tolist()))
# Import packages
from wordcloud import WordCloud
from PIL import Image
# Import image to np.array
mask = np.array(Image.open('basket.png'))
# Generate wordcloud
wordcloud = WordCloud(width = 2500, height = 1500, random_state=1, background_color='white', colormap='Set2', collocations=False,  mask=mask).generate_from_frequencies(data)
# Plot
plot_cloud(wordcloud)
 

2.2 Data Exploration based on Hours of Day

In [43]:
product_orders_by_hour = pd.DataFrame({'count': temp.groupby(['product_id', 'order_hour_of_day']).size()}).reset_index()
product_orders_by_hour['pct'] = product_orders_by_hour.groupby('product_id')['count'].apply(lambda x: x/x.sum()*100)
mean_hour = pd.DataFrame({'mean_hour': product_orders_by_hour.groupby('product_id').apply(lambda x: sum(x['order_hour_of_day'] * x['count'])/sum(x['count']))}).reset_index()
In [44]:
morning = mean_hour.sort_values('mean_hour')[:15]
morning = morning.merge(products, on='product_id')
morning
Out[44]:
product_id mean_hour department_id product_name
0 125 12.878016 19 trail mix snack mix
1 46 12.880901 19 mint gum
2 111 13.008237 17 plates bowls cups flatware
3 113 13.010753 1 frozen juice
4 94 13.021738 7 tea
5 53 13.073082 16 cream
6 11 13.078037 11 cold flu allergy
7 32 13.084674 4 packaged produce
8 29 13.114774 13 honeys syrups nectars
9 26 13.147703 7 coffee
10 3 13.181291 19 energy granola bars
11 93 13.220811 3 breakfast bakery
12 48 13.227224 14 breakfast bars pastries
13 75 13.232554 17 laundry
14 109 13.233945 11 skin care
In [45]:
afternoon = mean_hour.sort_values('mean_hour', ascending=False)[:15]
afternoon = afternoon.merge(products, on='product_id')
afternoon
Out[45]:
product_id mean_hour department_id product_name
0 28 14.125995 5 red wines
1 132 14.028424 11 beauty
2 37 13.984133 1 ice cream ice
3 103 13.919263 19 ice cream toppings
4 62 13.830428 5 white wines
5 134 13.793160 5 specialty wines champagnes
6 124 13.787187 5 spirits
7 42 13.756684 1 frozen vegan vegetarian
8 80 13.748918 11 deodorants
9 119 13.742017 1 frozen dessert
10 12 13.737607 9 fresh pasta
11 71 13.735865 16 refrigerated pudding desserts
12 22 13.729178 11 hair care
13 79 13.684907 1 frozen pizza
14 20 13.675958 11 oral hygiene
In [46]:
morning_pct = product_orders_by_hour.merge(morning, on='product_id').sort_values(['mean_hour', 'order_hour_of_day'])
afternoon_pct = product_orders_by_hour.merge(afternoon, on='product_id').sort_values(['mean_hour', 'order_hour_of_day'], ascending=False)
In [47]:
# get list of morning and afteroon product names
morning_product_names = list(morning_pct['product_name'].unique())
morning_product_names = '\n'.join(morning_product_names)
afternoon_product_names = list(afternoon_pct['product_name'].unique())
afternoon_product_names = '\n'.join(afternoon_product_names)
In [48]:
# Figure Size
fig, ax = plt.subplots(figsize=(16, 10))

# Plot
morning_pct.groupby('product_id').plot(x='order_hour_of_day', 
                                       y='pct', 
                                       ax=ax, 
                                       legend=False,
                                       alpha=0.2,
                                       aa=True,
                                       color='#ff8200',
                                       linewidth=1.5,)
afternoon_pct.groupby('product_id').plot(x='order_hour_of_day', 
                                         y='pct', 
                                         ax=ax, 
                                         legend=False,
                                         alpha=0.2,
                                         aa=True,
                                         color = '#43b02a',
                                         linewidth=1.5,)

# Aesthetics
# Margins
plt.margins(x=0.5, y=0.05)

# Hide spines
for spine in ax.spines.values():
    spine.set_visible(False)

# Labels
label_font_size = 14
plt.xlabel('Hour of Day Ordered', fontsize=label_font_size)
plt.ylabel('Percent of Orders by Product', fontsize=label_font_size)

# Tick Range
tick_font_size = 12
ax.tick_params(labelsize=tick_font_size)
plt.xticks(range(0, 25, 2))
plt.yticks(range(0, 16, 5))
plt.xlim([-2, 28])

# Vertical line at noon
plt.vlines(x=12, ymin=0, ymax=15, alpha=0.5, color='gray', linestyle='dashed', linewidth=1.0)

# Text
text_font_size = 12
ax.text(0.01, 0.95, morning_product_names,
        verticalalignment='top', horizontalalignment='left',
        transform=ax.transAxes,
        color='#ff8200', fontsize=text_font_size)
ax.text(0.99, 0.95, afternoon_product_names,
        verticalalignment='top', horizontalalignment='right',
        transform=ax.transAxes,
        color='#43b02a', fontsize=text_font_size);

ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()

3. Data Preparation

3.1. Handling missing values

In [49]:
total = order_products.isnull().sum().sort_values(ascending=False)
percent = (order_products.isnull().sum()/order_products.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total Missing', 'Percent'])
missing_data
Out[49]:
Total Missing Percent
order_id 0 0.0
product_id 0 0.0
add_to_cart_order 0 0.0
reordered 0 0.0
In [50]:
all_data = pd.merge(user_order_products_all_details, departments, on='department_id')
In [51]:
all_data.isnull().sum().sort_values(ascending=False)
# they are not missing - it's first order
Out[51]:
days_since_prior_order    124342
order_id                       0
user_id                        0
order_number                   0
order_dow                      0
order_hour_of_day              0
product_id                     0
add_to_cart_order              0
reordered                      0
department_id                  0
product_name                   0
department                     0
dtype: int64

4. Clusterization

4.1 Customer clusterization

In [52]:
order_products_cust = order_products.merge(products, on ='product_id', how='left')
order_products_cust = order_products_cust.merge(departments, on ='department_id', how='left')
order_products_cust = order_products_cust.merge(orders, on='order_id', how='left')
order_products_cust.shape
Out[52]:
(2019501, 12)
In [53]:
order_products_cust
Out[53]:
order_id product_id add_to_cart_order reordered department_id product_name department user_id order_number order_dow order_hour_of_day days_since_prior_order
0 10 24 1 1 4 fresh fruits produce 135442 4 6 8 8.0
1 10 83 2 1 4 fresh vegetables produce 135442 4 6 8 8.0
2 10 16 3 0 4 fresh herbs produce 135442 4 6 8 8.0
3 10 24 4 1 4 fresh fruits produce 135442 4 6 8 8.0
4 10 83 5 1 4 fresh vegetables produce 135442 4 6 8 8.0
... ... ... ... ... ... ... ... ... ... ... ... ...
2019496 3420578 66 17 0 6 asian foods international 6586 6 5 21 5.0
2019497 3420578 16 18 1 4 fresh herbs produce 6586 6 5 21 5.0
2019498 3420693 37 1 0 1 ice cream ice frozen 68287 15 6 15 6.0
2019499 3420693 37 2 0 1 ice cream ice frozen 68287 15 6 15 6.0
2019500 3420693 37 3 0 1 ice cream ice frozen 68287 15 6 15 6.0

2019501 rows × 12 columns

In [54]:
cross_df = pd.crosstab(order_products_cust.user_id, order_products_cust.product_name)
In [55]:
df = cross_df.div(cross_df.sum(axis=1), axis=0)
df.head()
Out[55]:
product_name air fresheners candles asian foods baby accessories baby bath body care baby food formula bakery desserts baking ingredients baking supplies decor beauty beers coolers ... spreads tea tofu meat alternatives tortillas flat bread trail mix snack mix trash bags liners vitamins supplements water seltzer sparkling water white wines yogurt
user_id
2 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000
3 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 ... 0.052632 0.0 0.0 0.0 0.0 0.0 0.0 0.052632 0.0 0.000000
7 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.041667
10 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.033333
11 0.0 0.0 0.0 0.0 0.0 0.0 0.181818 0.0 0.0 0.0 ... 0.090909 0.0 0.0 0.0 0.0 0.0 0.0 0.181818 0.0 0.000000

5 rows × 134 columns

In [56]:
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
df_pca = pca.fit_transform(df)
df_pca = pd.DataFrame(df_pca)
df_pca.head()
Out[56]:
0 1 2 3 4 5 6 7 8 9
0 0.086824 -0.161612 -0.028145 -0.013102 -0.060559 -0.048852 -0.039011 -0.063166 0.141439 0.049442
1 0.003981 -0.097846 0.003631 -0.058415 0.099333 -0.056552 -0.017432 -0.030107 -0.006154 0.025352
2 -0.082569 -0.005921 -0.055186 0.018263 -0.035469 -0.022078 -0.034007 -0.055961 -0.037800 -0.097640
3 -0.006792 0.048308 -0.038930 -0.019813 0.052439 -0.018721 -0.017918 -0.025452 0.094820 0.048692
4 -0.166932 -0.034989 0.111684 -0.089712 0.038504 0.045620 -0.014655 0.064693 0.005956 -0.003125
In [57]:
Sum_of_squared_distances = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(df_pca)
    Sum_of_squared_distances.append(km.inertia_)
In [58]:
plt.subplots(figsize = (8, 5))
plt.plot(K, Sum_of_squared_distances, 'bx-', color= '#43b02a')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [59]:
clusterer = KMeans(n_clusters=5,random_state=42).fit(df_pca)
centers = clusterer.cluster_centers_
c_preds = clusterer.predict(df_pca)
print(centers)
[[-1.10221441e-01 -1.14754790e-02 -3.44696405e-02 -1.38605156e-02
  -2.40607981e-02 -8.75953775e-03 -2.46964771e-03 -3.05917273e-03
  -1.61816960e-03 -8.22194686e-04]
 [ 3.67453439e-02 -3.95870812e-02 -1.56045501e-02  1.77550169e-02
   3.70642100e-02  5.48031642e-03  1.23661254e-03  1.42182131e-04
   3.35480827e-03  3.39675230e-04]
 [ 1.10939384e-01  1.86900186e-01  2.85075188e-02  2.09604893e-04
  -1.03741343e-02  5.81472564e-03  1.10319386e-03  3.71073955e-03
  -1.70350409e-03 -6.49063005e-04]
 [-2.44199412e-01 -7.49394162e-02  4.86913305e-01  2.19673039e-02
   3.23835072e-02  1.68526327e-02  1.86806963e-02  5.77328585e-03
   1.82038765e-03  7.64804899e-03]
 [ 2.78385333e-01 -2.41178544e-01  3.40051684e-02 -1.55510941e-02
  -2.64884544e-02  3.15192544e-03 -1.23761965e-03  5.72704078e-03
  -3.30883668e-03  2.46647471e-03]]
In [60]:
temp_df = df_pca.iloc[:, 0:2]
temp_df.columns = ["pc1", "pc2"]
temp_df['cluster'] = c_preds
In [61]:
fig, ax = plt.subplots(figsize = (8, 5))
ax = sns.scatterplot(data = temp_df, x = "pc1", y = "pc2", hue = "cluster")
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.set_title("Cluster Visualization")
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [62]:
# top products per cluster
cross_df['cluster'] = c_preds

cluster1 = cross_df[cross_df.cluster == 0]
cluster2 = cross_df[cross_df.cluster == 1]
cluster3 = cross_df[cross_df.cluster == 2]
cluster4 = cross_df[cross_df.cluster == 3]
cluster5 = cross_df[cross_df.cluster == 4]

cluster1.shape
Out[62]:
(41894, 135)
In [63]:
cluster1.drop('cluster',axis=1).mean().sort_values(ascending=False)[0:10]
Out[63]:
product_name
fresh vegetables                 0.586027
fresh fruits                     0.562300
chips pretzels                   0.544183
packaged cheese                  0.536879
yogurt                           0.526973
water seltzer sparkling water    0.466153
packaged vegetables fruits       0.464840
milk                             0.447367
ice cream ice                    0.388671
refrigerated                     0.376569
dtype: float64
In [64]:
cluster2.shape
Out[64]:
(35087, 135)
In [65]:
cluster2.drop('cluster',axis=1).mean().sort_values(ascending=False)[0:10]
Out[65]:
product_name
fresh fruits                     3.805313
fresh vegetables                 2.210733
packaged vegetables fruits       1.773933
yogurt                           1.613076
packaged cheese                  0.811355
milk                             0.763417
water seltzer sparkling water    0.533274
soy lactosefree                  0.526320
chips pretzels                   0.481631
bread                            0.476245
dtype: float64
In [66]:
cluster3.shape
Out[66]:
(19311, 135)
In [67]:
cluster3.drop('cluster',axis=1).mean().sort_values(ascending=False)[0:10]
Out[67]:
product_name
fresh vegetables                 5.543887
fresh fruits                     2.265186
packaged vegetables fruits       1.258298
yogurt                           0.534410
fresh herbs                      0.512299
packaged cheese                  0.490446
milk                             0.396303
soy lactosefree                  0.368288
frozen produce                   0.332091
water seltzer sparkling water    0.318937
dtype: float64
In [68]:
cluster4.shape
Out[68]:
(2507, 135)
In [69]:
cluster4.drop('cluster',axis=1).mean().sort_values(ascending=False)[0:10]
Out[69]:
product_name
water seltzer sparkling water    2.759872
soft drinks                      0.256482
fresh fruits                     0.194256
yogurt                           0.120862
tea                              0.112086
energy granola bars              0.090945
milk                             0.090945
coffee                           0.089350
packaged vegetables fruits       0.087355
chips pretzels                   0.087355
dtype: float64
In [70]:
cluster5.shape
Out[70]:
(6474, 135)
In [71]:
cluster5.drop('cluster',axis=1).mean().sort_values(ascending=False)[0:10]
Out[71]:
product_name
fresh fruits                     3.820667
packaged vegetables fruits       0.519308
fresh vegetables                 0.505715
packaged produce                 0.400680
milk                             0.268922
yogurt                           0.224436
water seltzer sparkling water    0.192462
soy lactosefree                  0.170065
packaged cheese                  0.141798
frozen produce                   0.115385
dtype: float64
In [72]:
c1 = cluster1.drop('cluster',axis=1).mean().sort_values(ascending=False)
c2 = cluster2.drop('cluster',axis=1).mean().sort_values(ascending=False)
c3 = cluster3.drop('cluster',axis=1).mean().sort_values(ascending=False)
c4 = cluster4.drop('cluster',axis=1).mean().sort_values(ascending=False)
c5 = cluster5.drop('cluster',axis=1).mean().sort_values(ascending=False)
In [73]:
from IPython.display import display, HTML
cluster_means = [[c1['fresh fruits'],c1['fresh vegetables'],c1['packaged vegetables fruits'], c1['yogurt'], c1['packaged cheese'], c1['milk'],c1['water seltzer sparkling water'],c1['chips pretzels']],
                 [c2['fresh fruits'],c2['fresh vegetables'],c2['packaged vegetables fruits'], c2['yogurt'], c2['packaged cheese'], c2['milk'],c2['water seltzer sparkling water'],c2['chips pretzels']],
                 [c3['fresh fruits'],c3['fresh vegetables'],c3['packaged vegetables fruits'], c3['yogurt'], c3['packaged cheese'], c3['milk'],c3['water seltzer sparkling water'],c3['chips pretzels']],
                [c4['fresh fruits'],c4['fresh vegetables'],c4['packaged vegetables fruits'], c4['yogurt'], c4['packaged cheese'], c4['milk'],c4['water seltzer sparkling water'],c4['chips pretzels']],
                [c5['fresh fruits'],c5['fresh vegetables'],c5['packaged vegetables fruits'], c5['yogurt'], c5['packaged cheese'], c5['milk'],c5['water seltzer sparkling water'],c5['chips pretzels']]]
cluster_means = pd.DataFrame(cluster_means, columns = ['fresh fruits','fresh vegetables','packaged vegetables fruits','yogurt','packaged cheese','milk','water seltzer sparkling water','chips pretzels'])
HTML(cluster_means.to_html())
Out[73]:
fresh fruits fresh vegetables packaged vegetables fruits yogurt packaged cheese milk water seltzer sparkling water chips pretzels
0 0.562300 0.586027 0.464840 0.526973 0.536879 0.447367 0.466153 0.544183
1 3.805313 2.210733 1.773933 1.613076 0.811355 0.763417 0.533274 0.481631
2 2.265186 5.543887 1.258298 0.534410 0.490446 0.396303 0.318937 0.241624
3 0.194256 0.063821 0.087355 0.120862 0.061029 0.090945 2.759872 0.087355
4 3.820667 0.505715 0.519308 0.224436 0.141798 0.268922 0.192462 0.111832
In [74]:
cluster_perc = cluster_means.iloc[:, :].apply(lambda x: (x / x.sum())*100,axis=1)
HTML(cluster_perc.to_html())
Out[74]:
fresh fruits fresh vegetables packaged vegetables fruits yogurt packaged cheese milk water seltzer sparkling water chips pretzels
0 13.599469 14.173306 11.242351 12.745064 12.984644 10.819767 11.274102 13.161298
1 31.730155 18.433942 14.791736 13.450447 6.765386 6.365661 4.446647 4.016027
2 20.501104 50.175049 11.388252 4.836691 4.438789 3.586744 2.886549 2.186822
3 5.605433 1.841621 2.520718 3.487569 1.761050 2.624309 79.638582 2.520718
4 66.042774 8.741623 8.976584 3.879529 2.451072 4.648493 3.326836 1.933089

4.2 Product clusterization

In [75]:
#calculate avg
temp = pd.merge(left=products,
         right=order_products.product_id.value_counts().to_frame('count'), 
         left_index=True, right_index=True)

temp = pd.merge(left=temp, 
                    right=pd.DataFrame(order_products.groupby('product_id').reordered.sum().to_frame(), dtype='int64'),  
                    left_index=True, right_index=True)

temp['reorder_rate'] = temp['reordered']/temp['count']

temp = pd.merge(left=temp, 
                right=order_products.groupby('product_id').add_to_cart_order.mean().to_frame('add_to_cart_mean'),
                left_index=True, right_index=True)

temp = pd.merge(left=temp, 
                right=pd.merge(left=order_products, 
                               right=orders[['order_dow', 'order_hour_of_day', 'days_since_prior_order']], 
                               left_on='order_id', right_index=True).groupby('product_id').order_dow.mean().to_frame(),
                left_index=True, right_index=True)

temp = pd.merge(left=temp, 
                right=pd.merge(left=order_products, 
                               right=orders[['order_dow', 'order_hour_of_day', 'days_since_prior_order']], 
                               left_on='order_id', right_index=True).groupby('product_id').order_hour_of_day.mean().to_frame(),
                left_index=True, right_index=True)

temp = pd.merge(left=temp, 
                right=pd.merge(left=order_products, 
                               right=orders[['order_dow', 'order_hour_of_day', 'days_since_prior_order']], 
                               left_on='order_id', right_index=True).groupby('product_id').days_since_prior_order.mean().to_frame(),
                left_index=True, right_index=True)
display(temp.head())
temp.shape
product_id department_id product_name count reordered reorder_rate add_to_cart_mean order_dow order_hour_of_day days_since_prior_order
1 104 13 spices seasonings 4384 2568 0.585766 8.249088 2.988281 13.625000 11.408333
2 94 7 tea 5102 2473 0.484712 9.302822 2.716263 13.307958 11.992620
3 38 1 frozen meals 28639 17181 0.599916 9.679633 2.769557 13.516677 10.973497
4 5 13 marinades meat preparation 12651 6194 0.489606 10.159355 2.706989 13.600806 10.862661
5 11 11 cold flu allergy 3836 1066 0.277894 10.288060 2.658768 13.843602 12.298969
Out[75]:
(133, 10)
In [76]:
#Scaling with StandardScaler
temp.drop(['product_name', 'department_id', 'reordered'], axis=1, inplace=True)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
temp_scaled = scaler.fit_transform(temp)
In [77]:
def fancy_dendrogram(*args, **kwargs):
    max_d = kwargs.pop('max_d', None)
    if max_d and 'color_threshold' not in kwargs:
        kwargs['color_threshold'] = max_d
    annotate_above = kwargs.pop('annotate_above', 0)
    plt.figure(figsize=(15,10))
    ddata = dendrogram(*args, **kwargs)

    if not kwargs.get('no_plot', False):
        plt.title('Hierarchical Clustering Dendrogram (truncated)')
        plt.xlabel('sample index or (cluster size)')
        plt.ylabel('distance')
        for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
            x = 0.5 * sum(i[1:3])
            y = d[1]
            if y > annotate_above:
                plt.plot(x, y, 'o', c=c)
                plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
                             textcoords='offset points',
                             va='top', ha='center')
        if max_d:
            plt.axhline(y=max_d, c='k')
    return ddata
In [78]:
from scipy.cluster.hierarchy import dendrogram, ward

linked_array = ward(temp_scaled)

fancy_dendrogram(
    linked_array,
    truncate_mode='lastp',
    p=30,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=10,
    max_d=12.5
)
ax = plt.gca()
ax.set_facecolor('#fff0db')

plt.show()
In [79]:
from scipy.cluster.hierarchy import fcluster
max_d = 12.5
clusters = fcluster(linked_array, max_d, criterion='distance')
    
labels, counts = np.unique(clusters, return_counts=True)

temp['clusters'] = clusters

print('reorder rates for each cluster\n')
for i in range(1,len(np.unique(clusters))+1):
    print('\nlabel: {}'.format(i))
    print('n: {}'.format(counts[i-1]))
    print('rr: {}'.format(round(temp[temp['clusters'] == i].reorder_rate.mean()*100, 2))) 
reorder rates for each cluster


label: 1
n: 17
rr: 65.71

label: 2
n: 34
rr: 36.08

label: 3
n: 82
rr: 49.92
In [80]:
merged_df = pd.DataFrame()
for i in range(1,4):
    test = pd.DataFrame(temp[temp['clusters'] == i].mean())
    test = test.T.set_index('clusters', drop = True)
    test['size'] = temp[temp['clusters'] == i].shape[0]
    merged_df = pd.concat([merged_df, test])
merged_df.T.round(2).drop('product_id')
Out[80]:
clusters 1.0 2.0 3.0
count 56165.12 5570.82 10666.73
reorder_rate 0.66 0.36 0.50
add_to_cart_mean 6.53 9.10 8.98
order_dow 2.75 2.63 2.80
order_hour_of_day 13.52 13.39 13.53
days_since_prior_order 11.21 11.63 10.73
size 17.00 34.00 82.00

5. Association Rules Analysis

In [81]:
df = pd.merge(order_products, products, how='left', on='product_id').drop(["product_id", "department_id","add_to_cart_order", "reordered"], axis=1)
In [82]:
# Define dataset to machine learning
market_basket = pd.pivot_table(df, index='order_id', columns='product_name',aggfunc=lambda x: 1 if len(x)>0 else 0).fillna(0)

market_basket.head()
Out[82]:
product_name air fresheners candles asian foods baby accessories baby bath body care baby food formula bakery desserts baking ingredients baking supplies decor beauty beers coolers ... spreads tea tofu meat alternatives tortillas flat bread trail mix snack mix trash bags liners vitamins supplements water seltzer sparkling water white wines yogurt
order_id
10 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
28 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
38 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 134 columns

5.1 Analysis for Entire Dataset (General Rules)

Support:

refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions

Confidence:

refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought

Lift:

refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B)

Leverage:

computes the difference between the observed frequency of A and C appearing together and the frequency that would be expected if A and C were independent

Conviction:

A high conviction value means that the consequent is highly depending on the antecedent

In [83]:
# Apriori method request a min_support: Support is defined as the percentage of time that an itemset appears in the dataset.
# Defined to start seeing data/results with min_support of 5%

itemsets = apriori(market_basket, min_support= 0.05, use_colnames=True)
In [84]:
# Build your association rules using the mxltend association_rules function.
# min_threshold can be thought of as the level of confidence percentage that you want to return
# Defined to use 50% of min_threshold

rulesConfidence = association_rules(itemsets, metric='confidence', min_threshold=0.6)
rulesConfidence.sort_values(by='confidence', ascending=False, inplace=True)
rulesConfidence.head()
Out[84]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
43 (fresh fruits, fresh herbs) (fresh vegetables) 0.070135 0.444360 0.061815 0.881372 1.983463 0.030650 4.683872
86 (packaged vegetables fruits, fresh vegetables,... (fresh fruits) 0.087995 0.555995 0.076240 0.866413 1.558311 0.027315 3.323711
80 (packaged vegetables fruits, milk, fresh veget... (fresh fruits) 0.073075 0.555995 0.062535 0.855765 1.539159 0.021906 3.078336
27 (fresh herbs) (fresh vegetables) 0.093005 0.444360 0.078655 0.845707 1.903203 0.037327 3.601205
83 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 0.081970 0.555995 0.068325 0.833537 1.499180 0.022750 2.667284
In [85]:
# Generate the association rules - by lift
rulesLift = association_rules(itemsets, metric="lift", min_threshold=1.6)
rulesLift.sort_values(by='lift', ascending=False, inplace=True)
rulesLift.head()
Out[85]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
15 (fresh herbs) (fresh fruits, fresh vegetables) 0.093005 0.317560 0.061815 0.664642 2.092964 0.032280 2.034958
12 (fresh fruits, fresh vegetables) (fresh herbs) 0.317560 0.093005 0.061815 0.194656 2.092964 0.032280 1.126221
14 (fresh vegetables) (fresh fruits, fresh herbs) 0.444360 0.070135 0.061815 0.139110 1.983463 0.030650 1.080121
13 (fresh fruits, fresh herbs) (fresh vegetables) 0.070135 0.444360 0.061815 0.881372 1.983463 0.030650 4.683872
83 (yogurt, fresh vegetables) (fresh fruits, packaged vegetables fruits) 0.144660 0.269870 0.076240 0.527029 1.952899 0.037201 1.543710
In [86]:
support = rulesConfidence.support.to_numpy()
confidence = rulesConfidence.confidence.to_numpy()

for i in range (len(support)):
    support[i] = support[i]
    confidence[i] = confidence[i]

plt.figure(figsize=(8,6))    
plt.title('Assonciation Rules')
plt.xlabel('support')
plt.ylabel('confidance')
sns.regplot(x=support, y=confidence, fit_reg=False, color = '#ff8200')
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [87]:
# Recommendation of Market Basket
rec_rules = rulesConfidence[ (rulesConfidence['lift'] > 1.7) & (rulesConfidence['confidence'] >= 0.7) ]
In [88]:
# Recommendation of Market Basket Dataset
cols_keep = {'antecedents':'antecedents', 'consequents':'consequents', 'support':'support', 'confidence':'confidence', 'lift':'lift'}
cols_drop = ['antecedent support', 'consequent support', 'leverage', 'conviction']

recommendation_basket = pd.DataFrame(rec_rules).rename(columns= cols_keep).drop(columns=cols_drop).sort_values(by=['lift'], ascending = False)

display(recommendation_basket)
antecedents consequents support confidence lift
43 (fresh fruits, fresh herbs) (fresh vegetables) 0.061815 0.881372 1.983463
27 (fresh herbs) (fresh vegetables) 0.078655 0.845707 1.903203
1 (canned jarred vegetables) (fresh vegetables) 0.055055 0.765823 1.723429
In [89]:
rulesConfidence['lhs items'] = rulesConfidence['antecedents'].apply(lambda x:len(x))
# Replace frozen sets with strings
rulesConfidence['antecedents_'] = rulesConfidence['antecedents'].apply(lambda a: ','.join(list(a)))
rulesConfidence['consequents_'] = rulesConfidence['consequents'].apply(lambda a: ','.join(list(a)))
# Transform the DataFrame of rulesConfidence into a matrix using the lift metric
pivot = rulesConfidence[rulesConfidence['lhs items']>1].pivot(index = 'antecedents_', 
                    columns = 'consequents_', values= 'lift')
# Generate a heatmap with annotations on and the colorbar off
plt.figure(figsize=(10,6))
sns.heatmap(pivot, annot = True,cmap="Greens")
plt.yticks(rotation=0)
plt.xticks(rotation=90)
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()
In [90]:
top_rules = rulesConfidence.sort_values('confidence', ascending = False)[:10]
top_rules
Out[90]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction lhs items antecedents_ consequents_
43 (fresh fruits, fresh herbs) (fresh vegetables) 0.070135 0.444360 0.061815 0.881372 1.983463 0.030650 4.683872 2 fresh fruits,fresh herbs fresh vegetables
86 (packaged vegetables fruits, fresh vegetables,... (fresh fruits) 0.087995 0.555995 0.076240 0.866413 1.558311 0.027315 3.323711 3 packaged vegetables fruits,fresh vegetables,yo... fresh fruits
80 (packaged vegetables fruits, milk, fresh veget... (fresh fruits) 0.073075 0.555995 0.062535 0.855765 1.539159 0.021906 3.078336 3 packaged vegetables fruits,milk,fresh vegetables fresh fruits
27 (fresh herbs) (fresh vegetables) 0.093005 0.444360 0.078655 0.845707 1.903203 0.037327 3.601205 1 fresh herbs fresh vegetables
83 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 0.081970 0.555995 0.068325 0.833537 1.499180 0.022750 2.667284 3 packaged vegetables fruits,packaged cheese,fre... fresh fruits
69 (packaged vegetables fruits, yogurt) (fresh fruits) 0.127910 0.555995 0.105790 0.827066 1.487542 0.034673 2.567481 2 packaged vegetables fruits,yogurt fresh fruits
67 (packaged vegetables fruits, soy lactosefree) (fresh fruits) 0.081385 0.555995 0.066960 0.822756 1.479790 0.021710 2.505050 2 packaged vegetables fruits,soy lactosefree fresh fruits
59 (yogurt, fresh vegetables) (fresh fruits) 0.144660 0.555995 0.118420 0.818609 1.472332 0.037990 2.447781 2 yogurt,fresh vegetables fresh fruits
63 (packaged vegetables fruits, milk) (fresh fruits) 0.107425 0.555995 0.087450 0.814056 1.464143 0.027722 2.387847 2 packaged vegetables fruits,milk fresh fruits
61 (frozen produce, packaged vegetables fruits) (fresh fruits) 0.066985 0.555995 0.054415 0.812346 1.461067 0.017172 2.366084 2 frozen produce,packaged vegetables fruits fresh fruits
In [91]:
def draw_graph(rules, rules_to_show):
    import networkx as nx  
    G1 = nx.DiGraph()
    color_map=[]
    N = 50
    colors = np.random.rand(N)    
    strs=['R0', 'R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8', 'R9', 'R10', 'R11']   
    for i in range (rules_to_show):      
        G1.add_nodes_from(["R"+str(i)])
        for a in rules.iloc[i]['antecedents']:
            G1.add_nodes_from([a])
            G1.add_edge(a, "R"+str(i), color=colors[i] , weight = 2)
        for c in rules.iloc[i]['consequents']:
                G1.add_nodes_from([c])
                G1.add_edge("R"+str(i), c, color=colors[i],  weight=2)
    for node in G1:
        found_a_string = False
        for item in strs: 
            if node==item:
                 found_a_string = True
        if found_a_string:
             color_map.append('#ff8200')
        else:
             color_map.append('#43b02a')       
  
  
    edges = G1.edges()
    colors = [G1[u][v]['color'] for u,v in edges]
    weights = [G1[u][v]['weight'] for u,v in edges]
   
    pos = nx.spring_layout(G1, k=16, scale=1, seed=1)
    nx.draw(G1, pos, edgelist =edges, node_color = color_map, edge_color=colors, width=weights, font_size=16, with_labels=False)            
     
    for p in pos:  # raise text positions
             pos[p][1] += 0.07
    nx.draw_networkx_labels(G1, pos)
    ax = plt.gca()
    ax.set_facecolor('#fff0db')
    plt.show()
In [92]:
draw_graph (rulesConfidence, 10)

5.1.2 General - Substitutes

In [93]:
itemsets
Out[93]:
support itemsets
0 0.076635 (baking ingredients)
1 0.163865 (bread)
2 0.067765 (breakfast bakery)
3 0.074330 (butter)
4 0.069305 (candy chocolate)
... ... ...
151 0.051295 (packaged vegetables fruits, milk, yogurt)
152 0.051915 (packaged vegetables fruits, packaged cheese, ...
153 0.062535 (fresh fruits, packaged vegetables fruits, mil...
154 0.068325 (fresh fruits, packaged vegetables fruits, pac...
155 0.076240 (fresh fruits, packaged vegetables fruits, yog...

156 rows × 2 columns

In [94]:
frequent_itemsets=itemsets.copy()
In [95]:
# Add a column with the length
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

# Length=2 and Support>=0.2
frequent_itemsets[(frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >= 0.5)]
Out[95]:
support itemsets length
In [96]:
# Substitute products 
rulesLift2 = association_rules(frequent_itemsets, metric="lift", min_threshold=0.0)
rulesLift2.sort_values(by='lift', ascending=True, inplace=True)
rulesLift2.head(20)
Out[96]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
110 (water seltzer sparkling water) (fresh vegetables) 0.193005 0.444360 0.083355 0.431880 0.971915 -0.002409 0.978033
111 (fresh vegetables) (water seltzer sparkling water) 0.444360 0.193005 0.083355 0.187584 0.971915 -0.002409 0.993328
231 (fresh fruits, fresh vegetables) (water seltzer sparkling water) 0.317560 0.193005 0.063235 0.199128 1.031723 0.001944 1.007645
234 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 0.193005 0.317560 0.063235 0.327634 1.031723 0.001944 1.014983
82 (fresh fruits) (water seltzer sparkling water) 0.555995 0.193005 0.111045 0.199723 1.034807 0.003735 1.008395
83 (water seltzer sparkling water) (fresh fruits) 0.193005 0.555995 0.111045 0.575348 1.034807 0.003735 1.045573
132 (packaged vegetables fruits) (water seltzer sparkling water) 0.365415 0.193005 0.073715 0.201730 1.045204 0.003188 1.010929
133 (water seltzer sparkling water) (packaged vegetables fruits) 0.193005 0.365415 0.073715 0.381933 1.045204 0.003188 1.026725
59 (ice cream ice) (fresh fruits) 0.110510 0.555995 0.064485 0.583522 1.049509 0.003042 1.066094
58 (fresh fruits) (ice cream ice) 0.555995 0.110510 0.064485 0.115981 1.049509 0.003042 1.006189
93 (fresh vegetables) (ice cream ice) 0.444360 0.110510 0.051995 0.117011 1.058827 0.002889 1.007362
92 (ice cream ice) (fresh vegetables) 0.110510 0.444360 0.051995 0.470500 1.058827 0.002889 1.049368
289 (water seltzer sparkling water) (fresh fruits, packaged vegetables fruits) 0.193005 0.269870 0.056550 0.292998 1.085699 0.004464 1.032712
284 (fresh fruits, packaged vegetables fruits) (water seltzer sparkling water) 0.269870 0.193005 0.056550 0.209545 1.085699 0.004464 1.020925
21 (fresh vegetables) (chips pretzels) 0.444360 0.169435 0.082245 0.185086 1.092374 0.006955 1.019206
20 (chips pretzels) (fresh vegetables) 0.169435 0.444360 0.082245 0.485407 1.092374 0.006955 1.079767
139 (water seltzer sparkling water) (yogurt) 0.193005 0.263675 0.056185 0.291106 1.104035 0.005294 1.038696
138 (yogurt) (water seltzer sparkling water) 0.263675 0.193005 0.056185 0.213084 1.104035 0.005294 1.025516
105 (fresh vegetables) (refrigerated) 0.444360 0.134040 0.065825 0.148134 1.105151 0.006263 1.016545
104 (refrigerated) (fresh vegetables) 0.134040 0.444360 0.065825 0.491085 1.105151 0.006263 1.091812
In [97]:
# Generate the association rules - by confidence
rulesConfidence = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.50)

### Plot a basic network graph of the top 50 confidence rules
# Create a copy of the rules and transform the frozensets to strings
rulesToPlot = rulesConfidence.copy(deep=True)
rulesToPlot['LHS'] = [','.join(list(x)) for x in rulesToPlot['antecedents']]
rulesToPlot['RHS'] = [','.join(list(x)) for x in rulesToPlot['consequents']]

# Remove duplicate if reversed rules
rulesToPlot['sortedRow'] = [sorted([a,b]) for a,b in zip(rulesToPlot.LHS, rulesToPlot.RHS)]
rulesToPlot['sortedRow'] = rulesToPlot['sortedRow'].astype(str)
rulesToPlot.drop_duplicates(subset=['sortedRow'], inplace=True)

rulesToPlot
Out[97]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction LHS RHS sortedRow
0 (bread) (fresh fruits) 0.163865 0.555995 0.112410 0.685992 1.233809 0.021302 1.413990 bread fresh fruits ['bread', 'fresh fruits']
1 (bread) (fresh vegetables) 0.163865 0.444360 0.091590 0.558936 1.257844 0.018775 1.259771 bread fresh vegetables ['bread', 'fresh vegetables']
2 (canned jarred vegetables) (fresh vegetables) 0.071890 0.444360 0.055055 0.765823 1.723429 0.023110 2.372733 canned jarred vegetables fresh vegetables ['canned jarred vegetables', 'fresh vegetables']
3 (canned meals beans) (fresh vegetables) 0.069705 0.444360 0.050435 0.723549 1.628295 0.019461 2.009906 canned meals beans fresh vegetables ['canned meals beans', 'fresh vegetables']
4 (cereal) (fresh fruits) 0.092560 0.555995 0.059620 0.644123 1.158505 0.008157 1.247635 cereal fresh fruits ['cereal', 'fresh fruits']
... ... ... ... ... ... ... ... ... ... ... ... ...
123 (fresh fruits, packaged vegetables fruits, yog... (fresh vegetables) 0.105790 0.444360 0.076240 0.720673 1.621822 0.029231 1.989210 fresh fruits,packaged vegetables fruits,yogurt fresh vegetables ['fresh fruits,packaged vegetables fruits,yogu...
124 (fresh fruits, yogurt, fresh vegetables) (packaged vegetables fruits) 0.118420 0.365415 0.076240 0.643810 1.761860 0.032968 1.781592 fresh fruits,yogurt,fresh vegetables packaged vegetables fruits ['fresh fruits,yogurt,fresh vegetables', 'pack...
125 (packaged vegetables fruits, fresh vegetables,... (fresh fruits) 0.087995 0.555995 0.076240 0.866413 1.558311 0.027315 3.323711 packaged vegetables fruits,fresh vegetables,yo... fresh fruits ['fresh fruits', 'packaged vegetables fruits,f...
126 (packaged vegetables fruits, yogurt) (fresh fruits, fresh vegetables) 0.127910 0.317560 0.076240 0.596044 1.876950 0.035621 1.689392 packaged vegetables fruits,yogurt fresh fruits,fresh vegetables ['fresh fruits,fresh vegetables', 'packaged ve...
127 (yogurt, fresh vegetables) (fresh fruits, packaged vegetables fruits) 0.144660 0.269870 0.076240 0.527029 1.952899 0.037201 1.543710 yogurt,fresh vegetables fresh fruits,packaged vegetables fruits ['fresh fruits,packaged vegetables fruits', 'y...

125 rows × 12 columns

In [98]:
# Plot
rulesToPlot=rulesToPlot[:50]
fig = plt.figure(figsize=(20, 20)) 
G = nx.from_pandas_edgelist(rulesToPlot, 'LHS', 'RHS')

my_pos = nx.spring_layout(G, seed = 1) #for reproducibility 
nx.draw(G, pos=my_pos, with_labels=True, edge_color = '#43b02a', node_size=40, node_color="#ff8200", font_color = 'black', font_size=20)
plt.axis('equal')
plt.show()

5.1.3 General - Complements

In [99]:
# High Confidence and high Lift - complementary products, parameter 1
rulesConfidence[(rulesConfidence['confidence'] >= 0.8) & (rulesConfidence['lift'] >= 1.8)]
Out[99]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
34 (fresh herbs) (fresh vegetables) 0.093005 0.44436 0.078655 0.845707 1.903203 0.037327 3.601205
64 (fresh fruits, fresh herbs) (fresh vegetables) 0.070135 0.44436 0.061815 0.881372 1.983463 0.030650 4.683872
In [100]:
# Complement

# High Confidence and high Lift - complementary products, parameter 2
rulesConfidence[(rulesConfidence['confidence'] >= 0.7) & (rulesConfidence['lift'] >= 1.7)]
Out[100]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
2 (canned jarred vegetables) (fresh vegetables) 0.071890 0.44436 0.055055 0.765823 1.723429 0.023110 2.372733
34 (fresh herbs) (fresh vegetables) 0.093005 0.44436 0.078655 0.845707 1.903203 0.037327 3.601205
64 (fresh fruits, fresh herbs) (fresh vegetables) 0.070135 0.44436 0.061815 0.881372 1.983463 0.030650 4.683872
In [101]:
# Complement

# High Confidence and high Lift - complementary products, parameter 3
rulesConfidence[(rulesConfidence['confidence'] >= 0.6) & (rulesConfidence['lift'] >= 1.6)]
Out[101]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
2 (canned jarred vegetables) (fresh vegetables) 0.071890 0.444360 0.055055 0.765823 1.723429 0.023110 2.372733
3 (canned meals beans) (fresh vegetables) 0.069705 0.444360 0.050435 0.723549 1.628295 0.019461 2.009906
34 (fresh herbs) (fresh vegetables) 0.093005 0.444360 0.078655 0.845707 1.903203 0.037327 3.601205
62 (packaged vegetables fruits, eggs) (fresh vegetables) 0.068650 0.444360 0.050675 0.738165 1.661186 0.020170 2.122097
63 (eggs, fresh vegetables) (packaged vegetables fruits) 0.084310 0.365415 0.050675 0.601056 1.644858 0.019867 1.590660
64 (fresh fruits, fresh herbs) (fresh vegetables) 0.070135 0.444360 0.061815 0.881372 1.983463 0.030650 4.683872
66 (fresh herbs) (fresh fruits, fresh vegetables) 0.093005 0.317560 0.061815 0.664642 2.092964 0.032280 2.034958
86 (fresh fruits, frozen produce) (packaged vegetables fruits) 0.089235 0.365415 0.054415 0.609794 1.668772 0.021807 1.626284
104 (packaged vegetables fruits, packaged cheese) (fresh vegetables) 0.114610 0.444360 0.081970 0.715208 1.609524 0.031042 1.951039
105 (packaged cheese, fresh vegetables) (packaged vegetables fruits) 0.135850 0.365415 0.081970 0.603386 1.651235 0.032328 1.600007
108 (soy lactosefree, fresh vegetables) (packaged vegetables fruits) 0.094120 0.365415 0.057695 0.612994 1.677528 0.023302 1.639729
110 (yogurt, fresh vegetables) (packaged vegetables fruits) 0.144660 0.365415 0.087995 0.608288 1.664651 0.035134 1.620031
113 (fresh fruits, packaged vegetables fruits, milk) (fresh vegetables) 0.087450 0.444360 0.062535 0.715094 1.609268 0.023676 1.950260
114 (fresh fruits, milk, fresh vegetables) (packaged vegetables fruits) 0.099590 0.365415 0.062535 0.627924 1.718387 0.026143 1.705527
118 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 0.090715 0.444360 0.068325 0.753183 1.694984 0.028015 2.251223
119 (fresh fruits, packaged cheese, fresh vegetables) (packaged vegetables fruits) 0.104570 0.365415 0.068325 0.653390 1.788077 0.030114 1.830833
123 (fresh fruits, packaged vegetables fruits, yog... (fresh vegetables) 0.105790 0.444360 0.076240 0.720673 1.621822 0.029231 1.989210
124 (fresh fruits, yogurt, fresh vegetables) (packaged vegetables fruits) 0.118420 0.365415 0.076240 0.643810 1.761860 0.032968 1.781592

5.2 First Time Orders

In [102]:
first_order = orders[orders['days_since_prior_order'].isnull()]
first_order_products=pd.merge(first_order,order_products, on='order_id',how='left')
# Rank the top 10 best-selling items
grouped = first_order_products.groupby("product_id")["reordered"].agg(frequency_count = 'count').reset_index()
grouped = pd.merge(grouped, products[['product_id', 'product_name']], how='left', on=['product_id'])
percent = grouped.product_name.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
grouped = grouped.sort_values(by= 'frequency_count', ascending=False)[:10]
grouped[['product_name','frequency_count']].reset_index(drop=True)
# we can see difference for the 1st and 2nd place only place
Out[102]:
product_name frequency_count
0 fresh vegetables 13085
1 fresh fruits 12815
2 packaged vegetables fruits 6580
3 yogurt 5575
4 packaged cheese 3819
5 milk 2996
6 water seltzer sparkling water 2885
7 chips pretzels 2808
8 soy lactosefree 2291
9 bread 2258
In [103]:
# let's check department
items  = pd.merge(left=products, right=departments, how='left')
users_flow_first = first_order_products.merge(items, how='inner', left_on='product_id', right_on='product_id')
grouped = users_flow_first.groupby("department")["order_id"].agg( Total_orders = 'count').reset_index()
grouped['Ratio'] = round(grouped["Total_orders"].apply(lambda x: x /grouped['Total_orders'].sum())*100,2)
grouped.sort_values(by='Total_orders', ascending=False, inplace=True)
grouped.reset_index(drop=True)[:10]
# difference 6-9 
Out[103]:
department Total_orders Ratio
0 produce 35080 28.21
1 dairy eggs 19949 16.04
2 snacks 10837 8.72
3 beverages 10026 8.06
4 frozen 9487 7.63
5 pantry 7462 6.00
6 bakery 4553 3.66
7 deli 4242 3.41
8 canned goods 4209 3.39
9 dry goods pasta 3579 2.88
In [104]:
first_order = orders.loc[orders['days_since_prior_order'].isnull(),'order_id']
first_order_products=pd.merge(first_order,order_products, on='order_id',how='left')
first_order_df = pd.merge(first_order_products, products, how='left', on='product_id').drop(["product_id", "department_id","add_to_cart_order", "reordered"], axis=1)
# Define dataset to machine learning
market_basket_first = pd.pivot_table(first_order_df, index='order_id', columns='product_name',aggfunc=lambda x: 1 if len(x)>0 else 0).fillna(0)

market_basket_first.head()
Out[104]:
product_name air fresheners candles asian foods baby accessories baby bath body care baby food formula bakery desserts baking ingredients baking supplies decor beauty beers coolers ... spreads tea tofu meat alternatives tortillas flat bread trail mix snack mix trash bags liners vitamins supplements water seltzer sparkling water white wines yogurt
order_id
110 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
548 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
560 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
858 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1225 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 134 columns

In [105]:
# Apriori method request a min_support: Support is defined as the percentage of time that an itemset appears in the dataset.
# Defined to start seeing data/results with min_support of 5%

itemsets = apriori(market_basket_first, min_support= 0.05, use_colnames=True)
In [106]:
# Build your association rules using the mxltend association_rules function.
# Defined to use confidence level of 0.6 of min_threshold

rules = association_rules(itemsets, metric='confidence', min_threshold=0.6)
In [107]:
# Below the list of products sales combinations
# It can use this information to build a cross-sell recommendation system that promotes these products with each other 

rules.sort_values("lift", ascending = False, inplace = True)
rules.head(10)
Out[107]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
45 (fresh herbs) (fresh fruits, fresh vegetables) 0.095724 0.312551 0.063163 0.659847 2.111164 0.033244 2.020997
43 (fresh fruits, fresh herbs) (fresh vegetables) 0.069855 0.448915 0.063163 0.904206 2.014204 0.031804 5.752794
24 (fresh herbs) (fresh vegetables) 0.095724 0.448915 0.081688 0.853367 1.900957 0.038716 3.758274
86 (fresh fruits, packaged cheese, fresh vegetables) (packaged vegetables fruits) 0.103232 0.360372 0.068386 0.662451 1.838240 0.031184 1.894916
72 (frozen produce, fresh vegetables) (packaged vegetables fruits) 0.085115 0.360372 0.054268 0.637584 1.769238 0.023595 1.764899
89 (fresh fruits, yogurt, fresh vegetables) (packaged vegetables fruits) 0.117349 0.360372 0.074588 0.635605 1.763746 0.032298 1.755315
85 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 0.088134 0.448915 0.068386 0.775926 1.728449 0.028821 2.459389
71 (frozen produce, packaged vegetables fruits) (fresh vegetables) 0.070181 0.448915 0.054268 0.773256 1.722501 0.022763 2.430428
1 (canned jarred vegetables) (fresh vegetables) 0.074017 0.448915 0.056798 0.767365 1.709378 0.023571 2.368884
83 (fresh fruits, milk, fresh vegetables) (packaged vegetables fruits) 0.089848 0.360372 0.055329 0.615804 1.708800 0.022950 1.664847
In [108]:
top_rules = rules.sort_values('confidence', ascending = False)[:10]
top_rules
Out[108]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
43 (fresh fruits, fresh herbs) (fresh vegetables) 0.069855 0.448915 0.063163 0.904206 2.014204 0.031804 5.752794
24 (fresh herbs) (fresh vegetables) 0.095724 0.448915 0.081688 0.853367 1.900957 0.038716 3.758274
90 (packaged vegetables fruits, fresh vegetables,... (fresh fruits) 0.087890 0.531582 0.074588 0.848654 1.596469 0.027867 3.095010
84 (packaged vegetables fruits, milk, fresh veget... (fresh fruits) 0.065938 0.531582 0.055329 0.839109 1.578514 0.020278 2.911401
69 (packaged vegetables fruits, soy lactosefree) (fresh fruits) 0.076954 0.531582 0.063163 0.820785 1.544043 0.022255 2.613720
87 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 0.083565 0.531582 0.068386 0.818359 1.539480 0.023964 2.578820
63 (frozen produce, packaged vegetables fruits) (fresh fruits) 0.070181 0.531582 0.056553 0.805814 1.515880 0.019246 2.412215
61 (yogurt, fresh vegetables) (fresh fruits) 0.146320 0.531582 0.117349 0.802008 1.508720 0.039569 2.365843
42 (fresh dips tapenades, fresh vegetables) (fresh fruits) 0.063245 0.531582 0.050433 0.797419 1.500089 0.016813 2.312257
57 (soy lactosefree, fresh vegetables) (fresh fruits) 0.093031 0.531582 0.074180 0.797368 1.499993 0.024726 2.311676
In [109]:
frequent_itemsets=itemsets.copy()
In [110]:
##### EXPLORE FREQUENT_ITEMSETS #####

# Add a column with the length
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

# Length=2 and Support>=0.2
frequent_itemsets[(frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >= 0.1)]
Out[110]:
support itemsets length
49 0.112045 (fresh fruits, bread) 2
58 0.105517 (fresh fruits, chips pretzels) 2
74 0.312551 (fresh fruits, fresh vegetables) 2
80 0.143300 (fresh fruits, milk) 2
83 0.153991 (fresh fruits, packaged cheese) 2
84 0.261384 (fresh fruits, packaged vegetables fruits) 2
87 0.109515 (fresh fruits, soy lactosefree) 2
90 0.179615 (fresh fruits, yogurt) 2
96 0.115636 (milk, fresh vegetables) 2
98 0.136690 (packaged cheese, fresh vegetables) 2
99 0.237473 (packaged vegetables fruits, fresh vegetables) 2
104 0.146320 (yogurt, fresh vegetables) 2
110 0.114085 (packaged vegetables fruits, packaged cheese) 2
115 0.126979 (packaged vegetables fruits, yogurt) 2
In [111]:
# Generate the association rules - by confidence
rulesConfidence = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.60)

### Plot a basic network graph of the top 50 confidence rules
# Create a copy of the rules and transform the frozensets to strings
rulesToPlot = rulesConfidence.copy(deep=True)
rulesToPlot['LHS'] = [','.join(list(x)) for x in rulesToPlot['antecedents']]
rulesToPlot['RHS'] = [','.join(list(x)) for x in rulesToPlot['consequents']]

# Remove duplicate if reversed rules
rulesToPlot['sortedRow'] = [sorted([a,b]) for a,b in zip(rulesToPlot.LHS, rulesToPlot.RHS)]
rulesToPlot['sortedRow'] = rulesToPlot['sortedRow'].astype(str)
rulesToPlot.drop_duplicates(subset=['sortedRow'], inplace=True)

rulesToPlot
Out[111]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction LHS RHS sortedRow
0 (bread) (fresh fruits) 0.166395 0.531582 0.112045 0.673369 1.266728 0.023593 1.434092 bread fresh fruits ['bread', 'fresh fruits']
1 (canned jarred vegetables) (fresh vegetables) 0.074017 0.448915 0.056798 0.767365 1.709378 0.023571 2.368884 canned jarred vegetables fresh vegetables ['canned jarred vegetables', 'fresh vegetables']
2 (canned meals beans) (fresh vegetables) 0.070508 0.448915 0.050596 0.717593 1.598506 0.018944 1.951384 canned meals beans fresh vegetables ['canned meals beans', 'fresh vegetables']
3 (cereal) (fresh fruits) 0.102416 0.531582 0.062837 0.613546 1.154190 0.008394 1.212093 cereal fresh fruits ['cereal', 'fresh fruits']
4 (chips pretzels) (fresh fruits) 0.173005 0.531582 0.105517 0.609906 1.147342 0.013550 1.200783 chips pretzels fresh fruits ['chips pretzels', 'fresh fruits']
... ... ... ... ... ... ... ... ... ... ... ... ...
86 (fresh fruits, packaged cheese, fresh vegetables) (packaged vegetables fruits) 0.103232 0.360372 0.068386 0.662451 1.838240 0.031184 1.894916 fresh fruits,packaged cheese,fresh vegetables packaged vegetables fruits ['fresh fruits,packaged cheese,fresh vegetable...
87 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 0.083565 0.531582 0.068386 0.818359 1.539480 0.023964 2.578820 packaged vegetables fruits,packaged cheese,fre... fresh fruits ['fresh fruits', 'packaged vegetables fruits,p...
88 (fresh fruits, packaged vegetables fruits, yog... (fresh vegetables) 0.101191 0.448915 0.074588 0.737097 1.641953 0.029162 2.096153 fresh fruits,packaged vegetables fruits,yogurt fresh vegetables ['fresh fruits,packaged vegetables fruits,yogu...
89 (fresh fruits, yogurt, fresh vegetables) (packaged vegetables fruits) 0.117349 0.360372 0.074588 0.635605 1.763746 0.032298 1.755315 fresh fruits,yogurt,fresh vegetables packaged vegetables fruits ['fresh fruits,yogurt,fresh vegetables', 'pack...
90 (packaged vegetables fruits, fresh vegetables,... (fresh fruits) 0.087890 0.531582 0.074588 0.848654 1.596469 0.027867 3.095010 packaged vegetables fruits,fresh vegetables,yo... fresh fruits ['fresh fruits', 'packaged vegetables fruits,f...

91 rows × 12 columns

In [112]:
# Plot
rulesToPlot=rulesToPlot[:50]
fig = plt.figure(figsize=(20, 20)) 
G = nx.from_pandas_edgelist(rulesToPlot, 'LHS', 'RHS')

my_pos = nx.spring_layout(G, seed = 1) #for reproducibility 
nx.draw(G, pos=my_pos, with_labels=True, edge_color = '#43b02a', node_size=40, node_color="#ff8200", font_color = 'black', font_size=20)
plt.axis('equal')
plt.show()

5.3 Top 15 add to order

In [113]:
atco = user_order_products_all_details[user_order_products_all_details['add_to_cart_order'] <= 15]
atco = pd.merge(atco,departments, how='left',on='department_id')
grouped = atco.groupby("department")["order_id"].agg( Total_orders = 'count').reset_index()
grouped['Ratio'] = round(grouped["Total_orders"].apply(lambda x: x /grouped['Total_orders'].sum())*100,2)
grouped.sort_values(by='Total_orders', ascending=False, inplace=True)
grouped.reset_index(drop=True)[:10]
Out[113]:
department Total_orders Ratio
0 produce 515312 29.69
1 dairy eggs 298915 17.22
2 beverages 151410 8.72
3 snacks 150196 8.65
4 frozen 116891 6.73
5 pantry 94150 5.42
6 bakery 63619 3.67
7 deli 55556 3.20
8 canned goods 53357 3.07
9 dry goods pasta 42978 2.48
In [114]:
atco = atco[['order_id','product_name']]
In [115]:
# Define dataset to machine learning
market_basket_atco = pd.pivot_table(atco, index='order_id', columns='product_name',aggfunc=lambda x: 1 if len(x)>0 else 0).fillna(0)

market_basket_atco.head()
Out[115]:
product_name air fresheners candles asian foods baby accessories baby bath body care baby food formula bakery desserts baking ingredients baking supplies decor beauty beers coolers ... spreads tea tofu meat alternatives tortillas flat bread trail mix snack mix trash bags liners vitamins supplements water seltzer sparkling water white wines yogurt
order_id
10 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
28 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
38 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 134 columns

In [116]:
# Apriori method request a min_support: Support is defined as the percentage of time that an itemset appears in the dataset.
# Defined to start seeing data/results with min_support of 2%

itemsets = apriori(market_basket_first, min_support= 0.05, use_colnames=True)
In [117]:
# Build your association rules using the mxltend association_rules function.
# min_threshold can be thought of as the level of confidence percentage that you want to return
# Defined to use 50% of min_threshold

rules = association_rules(itemsets, metric='confidence', min_threshold=0.6)
In [118]:
# Below the list of products sales combinations
# It can use this information to build a cross-sell recommendation system that promotes these products with each other 

rules.sort_values("lift", ascending = False, inplace = True)
rules.head(10)
Out[118]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
45 (fresh herbs) (fresh fruits, fresh vegetables) 0.095724 0.312551 0.063163 0.659847 2.111164 0.033244 2.020997
43 (fresh fruits, fresh herbs) (fresh vegetables) 0.069855 0.448915 0.063163 0.904206 2.014204 0.031804 5.752794
24 (fresh herbs) (fresh vegetables) 0.095724 0.448915 0.081688 0.853367 1.900957 0.038716 3.758274
86 (fresh fruits, packaged cheese, fresh vegetables) (packaged vegetables fruits) 0.103232 0.360372 0.068386 0.662451 1.838240 0.031184 1.894916
72 (frozen produce, fresh vegetables) (packaged vegetables fruits) 0.085115 0.360372 0.054268 0.637584 1.769238 0.023595 1.764899
89 (fresh fruits, yogurt, fresh vegetables) (packaged vegetables fruits) 0.117349 0.360372 0.074588 0.635605 1.763746 0.032298 1.755315
85 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 0.088134 0.448915 0.068386 0.775926 1.728449 0.028821 2.459389
71 (frozen produce, packaged vegetables fruits) (fresh vegetables) 0.070181 0.448915 0.054268 0.773256 1.722501 0.022763 2.430428
1 (canned jarred vegetables) (fresh vegetables) 0.074017 0.448915 0.056798 0.767365 1.709378 0.023571 2.368884
83 (fresh fruits, milk, fresh vegetables) (packaged vegetables fruits) 0.089848 0.360372 0.055329 0.615804 1.708800 0.022950 1.664847
In [119]:
top_rules = rules.sort_values('confidence', ascending = False)[:10]
top_rules
Out[119]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
43 (fresh fruits, fresh herbs) (fresh vegetables) 0.069855 0.448915 0.063163 0.904206 2.014204 0.031804 5.752794
24 (fresh herbs) (fresh vegetables) 0.095724 0.448915 0.081688 0.853367 1.900957 0.038716 3.758274
90 (packaged vegetables fruits, fresh vegetables,... (fresh fruits) 0.087890 0.531582 0.074588 0.848654 1.596469 0.027867 3.095010
84 (packaged vegetables fruits, milk, fresh veget... (fresh fruits) 0.065938 0.531582 0.055329 0.839109 1.578514 0.020278 2.911401
69 (packaged vegetables fruits, soy lactosefree) (fresh fruits) 0.076954 0.531582 0.063163 0.820785 1.544043 0.022255 2.613720
87 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 0.083565 0.531582 0.068386 0.818359 1.539480 0.023964 2.578820
63 (frozen produce, packaged vegetables fruits) (fresh fruits) 0.070181 0.531582 0.056553 0.805814 1.515880 0.019246 2.412215
61 (yogurt, fresh vegetables) (fresh fruits) 0.146320 0.531582 0.117349 0.802008 1.508720 0.039569 2.365843
42 (fresh dips tapenades, fresh vegetables) (fresh fruits) 0.063245 0.531582 0.050433 0.797419 1.500089 0.016813 2.312257
57 (soy lactosefree, fresh vegetables) (fresh fruits) 0.093031 0.531582 0.074180 0.797368 1.499993 0.024726 2.311676
In [120]:
frequent_itemsets=itemsets.copy()
In [121]:
##### EXPLORE FREQUENT_ITEMSETS #####

# Add a column with the length
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

# Length=2 and Support>=0.2
frequent_itemsets[(frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >= 0.1)]
Out[121]:
support itemsets length
49 0.112045 (fresh fruits, bread) 2
58 0.105517 (fresh fruits, chips pretzels) 2
74 0.312551 (fresh fruits, fresh vegetables) 2
80 0.143300 (fresh fruits, milk) 2
83 0.153991 (fresh fruits, packaged cheese) 2
84 0.261384 (fresh fruits, packaged vegetables fruits) 2
87 0.109515 (fresh fruits, soy lactosefree) 2
90 0.179615 (fresh fruits, yogurt) 2
96 0.115636 (milk, fresh vegetables) 2
98 0.136690 (packaged cheese, fresh vegetables) 2
99 0.237473 (packaged vegetables fruits, fresh vegetables) 2
104 0.146320 (yogurt, fresh vegetables) 2
110 0.114085 (packaged vegetables fruits, packaged cheese) 2
115 0.126979 (packaged vegetables fruits, yogurt) 2
In [122]:
# Generate the association rules - by confidence
rulesConfidence = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)

### Plot a basic network graph of the top 50 confidence rules
# Create a copy of the rules and transform the frozensets to strings
rulesToPlot = rulesConfidence.copy(deep=True)
rulesToPlot['LHS'] = [','.join(list(x)) for x in rulesToPlot['antecedents']]
rulesToPlot['RHS'] = [','.join(list(x)) for x in rulesToPlot['consequents']]

# Remove duplicate if reversed rules
rulesToPlot['sortedRow'] = [sorted([a,b]) for a,b in zip(rulesToPlot.LHS, rulesToPlot.RHS)]
rulesToPlot['sortedRow'] = rulesToPlot['sortedRow'].astype(str)
rulesToPlot.drop_duplicates(subset=['sortedRow'], inplace=True)

rulesToPlot
Out[122]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction LHS RHS sortedRow
0 (bread) (fresh fruits) 0.166395 0.531582 0.112045 0.673369 1.266728 0.023593 1.434092 bread fresh fruits ['bread', 'fresh fruits']
1 (canned jarred vegetables) (fresh vegetables) 0.074017 0.448915 0.056798 0.767365 1.709378 0.023571 2.368884 canned jarred vegetables fresh vegetables ['canned jarred vegetables', 'fresh vegetables']
2 (canned meals beans) (fresh vegetables) 0.070508 0.448915 0.050596 0.717593 1.598506 0.018944 1.951384 canned meals beans fresh vegetables ['canned meals beans', 'fresh vegetables']
3 (cereal) (fresh fruits) 0.102416 0.531582 0.062837 0.613546 1.154190 0.008394 1.212093 cereal fresh fruits ['cereal', 'fresh fruits']
4 (chips pretzels) (fresh fruits) 0.173005 0.531582 0.105517 0.609906 1.147342 0.013550 1.200783 chips pretzels fresh fruits ['chips pretzels', 'fresh fruits']
... ... ... ... ... ... ... ... ... ... ... ... ...
86 (fresh fruits, packaged cheese, fresh vegetables) (packaged vegetables fruits) 0.103232 0.360372 0.068386 0.662451 1.838240 0.031184 1.894916 fresh fruits,packaged cheese,fresh vegetables packaged vegetables fruits ['fresh fruits,packaged cheese,fresh vegetable...
87 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 0.083565 0.531582 0.068386 0.818359 1.539480 0.023964 2.578820 packaged vegetables fruits,packaged cheese,fre... fresh fruits ['fresh fruits', 'packaged vegetables fruits,p...
88 (fresh fruits, packaged vegetables fruits, yog... (fresh vegetables) 0.101191 0.448915 0.074588 0.737097 1.641953 0.029162 2.096153 fresh fruits,packaged vegetables fruits,yogurt fresh vegetables ['fresh fruits,packaged vegetables fruits,yogu...
89 (fresh fruits, yogurt, fresh vegetables) (packaged vegetables fruits) 0.117349 0.360372 0.074588 0.635605 1.763746 0.032298 1.755315 fresh fruits,yogurt,fresh vegetables packaged vegetables fruits ['fresh fruits,yogurt,fresh vegetables', 'pack...
90 (packaged vegetables fruits, fresh vegetables,... (fresh fruits) 0.087890 0.531582 0.074588 0.848654 1.596469 0.027867 3.095010 packaged vegetables fruits,fresh vegetables,yo... fresh fruits ['fresh fruits', 'packaged vegetables fruits,f...

91 rows × 12 columns

In [123]:
# Plot
rulesToPlot=rulesToPlot[:20]
fig = plt.figure(figsize=(20, 20)) 
G = nx.from_pandas_edgelist(rulesToPlot, 'LHS', 'RHS')

my_pos = nx.spring_layout(G, seed = 1) #for reproducibility 
nx.draw(G, pos=my_pos, with_labels=True, edge_color = '#43b02a', node_size=40, node_color="#ff8200", font_color = 'black', font_size=20)
plt.axis('equal')
plt.show()

6. Analysis per Day of Week

limitation:

days since prior order = 0...means the individual ordered twice in the same day. Could we consider this as one basket? The individual forgot to order other items needed? We can't explore this because we only have a sample, don't have the data for the earlier order made that day.

6.1 Feature Engineering, time of the day (tofd)

In [124]:
data2 = pd.merge(user_order_products_all_details, departments, on='department_id')
In [125]:
# We plan to conduct the days of the week analysis then hours of the day. However, instead of hours of the day we 
# tranform it to 4 categorical values--times of the day: 
# Morning (6 to 11, since 11:59 is still stamped as hour 11), Afternoon (12 to 16), evening (17 to 20), night (21 to 5).
# The exact hours of the times of the day is subjective and seasonaly changing. We will stick to the above for simplicity.

print(sum((data2['order_hour_of_day']>=6) & (data2['order_hour_of_day']<12)))
print(sum((data2['order_hour_of_day']>=12) & (data2['order_hour_of_day']<17))) 
print(sum((data2['order_hour_of_day']>=17) & (data2['order_hour_of_day']<21)))

# night:
print(sum(data2['order_hour_of_day']>=21))
print(sum(data2['order_hour_of_day']<6))
print((sum(data2['order_hour_of_day']>=21))+(sum(data2['order_hour_of_day']<6)))
673035
823122
372425
113950
36969
150919
In [126]:
#Morning = 0, Afternoon = 1, Evening = 2, Night = 3
# tofd = time of the day
data2['tofd'] = data2['order_hour_of_day']

data2.loc[(data2['order_hour_of_day']>=6) & (data2['order_hour_of_day']<12), 'tofd'] = 0
data2.loc[(data2['order_hour_of_day']>=12) & (data2['order_hour_of_day']<17), 'tofd'] = 1
data2.loc[(data2['order_hour_of_day']>=17) & (data2['order_hour_of_day']<21), 'tofd'] = 2

# night
data2.loc[data2['order_hour_of_day']>=21, 'tofd'] = 3
data2.loc[data2['order_hour_of_day']<6, 'tofd'] = 3
In [127]:
#Check
print(sum(data2['tofd']==0))
print(sum(data2['tofd']==1))
print(sum(data2['tofd']==2))
print(sum(data2['tofd']==3))
673035
823122
372425
150919
In [128]:
data2['tofd'].value_counts(sort=False).plot(kind='bar', color = '#43b02a')
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()

6.2 Data Split

6.2.1 Split by the days of the week.

In [129]:
# Data split per dow, saved in a list.
dow_range = [0,1,2,3,4,5,6]
df_dow_list=[]

for i in dow_range:
    df_input = data2[(data2['order_dow']==i)]
    df_dow_list.append(df_input)

Day index from the list.

0 = Sun
1 = Mon
2 = Tue
3 = Wed
4 = Thu
5 = Fri
6 = Sat

In [130]:
#example to retrieve Sunday data
df_dow_list[0]
Out[130]:
order_id user_id order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered department_id product_name department tofd
9 2772240 82371 23 0 8 7.0 51 3 1 13 preserved dips spreads pantry 0
13 325441 47716 5 0 16 30.0 29 11 0 13 honeys syrups nectars pantry 1
24 2749827 104844 27 0 14 7.0 104 1 1 13 spices seasonings pantry 1
25 2749827 104844 27 0 14 7.0 72 15 0 13 condiments pantry 1
28 2101376 171617 9 0 9 1.0 51 9 0 13 preserved dips spreads pantry 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019482 3396980 14573 4 0 16 6.0 6 2 0 2 other other 1
2019486 91813 195573 63 0 21 3.0 6 4 1 2 other other 3
2019488 2816975 48477 1 0 17 NaN 6 5 0 2 other other 2
2019493 2684142 62337 11 0 15 9.0 6 3 0 2 other other 1
2019497 1709950 73422 10 0 14 2.0 6 3 0 2 other other 1

391831 rows × 13 columns

6.2.2 Data Split per dow per tofd

In [131]:
# Data split per dow (days of the week) per tofd (time of the day), saved in a list.
dow_range = [0,1,2,3,4,5,6]
tofd_range = [0,1,2,3]
df_tofd_list=[]

for i in dow_range:
    for j in tofd_range:
        df_input = data2[(data2['order_dow']==i) & (data2['tofd']==j)]
        df_tofd_list.append(df_input)
In [132]:
df_tofd_list[0]
Out[132]:
order_id user_id order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered department_id product_name department tofd
9 2772240 82371 23 0 8 7.0 51 3 1 13 preserved dips spreads pantry 0
28 2101376 171617 9 0 9 1.0 51 9 0 13 preserved dips spreads pantry 0
29 2101376 171617 9 0 9 1.0 17 10 0 13 baking ingredients pantry 0
30 2101376 171617 9 0 9 1.0 17 11 0 13 baking ingredients pantry 0
31 2101376 171617 9 0 9 1.0 17 13 0 13 baking ingredients pantry 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019419 334536 173675 11 0 11 30.0 6 23 0 2 other other 0
2019434 3008454 186970 39 0 9 7.0 6 3 0 2 other other 0
2019449 3407232 36753 17 0 9 3.0 6 11 0 2 other other 0
2019455 1527852 182537 8 0 8 1.0 6 9 1 2 other other 0
2019478 1380913 32889 40 0 8 8.0 6 7 1 2 other other 0

123876 rows × 13 columns

The dfs_list can be accessed:

Day range: 0 = Sunday...6 = Monday
Time range: 0 = Morning...3 = Night

Day : Time
0 : 0 = df_tofd_list[0]
0 : 1 = df_tofd_list[1]
0 : 2 = df_tofd_list[2]
0 : 3 = df_tofd_list[3]

1 : 0 = df_tofd_list[4]
1 : 1 = df_tofd_list[5]
1 : 2 = df_tofd_list[6]
1 : 3 = df_tofd_list[7]
...(continues)...
6 : 3 = df_tofd_list[27]

Notice the multiples of 4 is a new day
For easier reference:
0 = Sun
4 = Mon
8 = Tue
12 = Wed
16 = Thu
20 = Fri
24 = Sat

In [133]:
df_output = pd.DataFrame()
for i in range(len(df_tofd_list)):
    df_output = df_output.append(pd.DataFrame([i,len(df_tofd_list[i])]).T)
#     print('Index {}: {}'.format(i,len(df_tofd_list[i])))
df_output.reset_index(inplace=True,drop=True)
In [134]:
df_output
Out[134]:
0 1
0 0 123876
1 1 171971
2 2 67392
3 3 28592
4 4 129591
5 5 137957
6 6 59382
7 7 22306
8 8 87562
9 9 103813
10 10 51077
11 11 19460
12 12 77609
13 13 96083
14 14 46722
15 15 18316
16 16 74338
17 17 91827
18 18 48189
19 19 20530
20 20 89662
21 21 104468
22 22 48585
23 23 19442
24 24 90397
25 25 117003
26 26 51078
27 27 22273
In [135]:
# plotting frequency of orders over day and time (i.e. morning, afternoon, evening, night)
# example: 3 is Sunday night 
plt.plot(df_output[0], df_output[1], color='#ff8200')
plt.title('Distribution',fontsize=14)
plt.xlabel('day-time identifier',fontsize=14)
plt.ylabel('order count',fontsize=14)
plt.rcParams['figure.figsize'] = [10, 5]
plt.rcParams['figure.dpi'] = 100
ax = plt.gca()
ax.set_facecolor('#fff0db')
plt.show()

6.3 Apriori, Confidence, Lift Analysis

In [136]:
# dfl = list of data frames (e.g., df_tofd_list)

# ms = apriori min_support value, 0.05 by default
# cmt = confidence min_threshold, 0.5 by default
# lmt = lift min_threshold, 1.5 by default

# cmptc = complement: min_threshold for confidence
# cmptl = complement: min_threshold for lift
# cmptl_c = complement: ceiling for lift (e.g, cmptl_c=1.5, max lift value is 1.5)
#subsmt = substitutes: min_threshold for lift
def ACL_per_split(dfl, ms=0.05, cmt=0.50, lmt=1.5, cmptc=0.8, cmptl=1.8, cmptl_c=99, subsmt=0.0):
    cmptl_c=cmptl_c
    filtdf_list=[]
    pivot_list=[]
    apriori_list=[]
    rConf_list=[] # List of association dfs by confidence
    rLift_list=[] # List of association dfs by lift
    cmpt_list=[] # List of potential complements dfs
    subs_list=[] # List of potential subs dfs
    for i in range(len(dfl)):
        df_input = dfl[i][['order_id','product_name']]
        filtdf_list.append(df_input)
        # Pivot the data - lines as orders and products as columns
        pt_input = pd.pivot_table(df_input, index='order_id', columns='product_name', 
                            aggfunc=lambda x: 1 if len(x)>0 else 0).fillna(0)
        pivot_list.append(pt_input)
        # Apriori
        freq_sets_input = apriori(pt_input, min_support=ms, use_colnames=True)
        apriori_list.append(freq_sets_input)
        # Association - by confidence
        rConf_input = association_rules(freq_sets_input, metric="confidence", min_threshold=cmt)
        rConf_input.sort_values(by='confidence', ascending=False, inplace=True)
        rConf_list.append(rConf_input)
        # Association - by lift
        rLift_input = association_rules(freq_sets_input, metric="lift", min_threshold=lmt)
        rLift_input.sort_values(by='lift', ascending=False, inplace=True)
        rLift_list.append(rLift_input)
        # High confidence and high lift - complementary products
        if cmptl_c != 99:
            cmpt_input = rConf_input[(rConf_input['confidence']>=cmptc)&(rConf_input['lift']>=cmptl)&(rConf_input['lift']<=cmptl_c)]
        else:
            cmpt_input = rConf_input[(rConf_input['confidence']>=cmptc)&(rConf_input['lift']>=cmptl)]
        cmpt_list.append(cmpt_input)
        # Subtitute products
        subs_input = association_rules(freq_sets_input, metric = 'lift',min_threshold=subsmt)
        subs_input.sort_values(by='lift', ascending=True, inplace=True)
        subs_list.append(subs_input)
    return rConf_list, rLift_list, cmpt_list, subs_list

6.3.1 DOW Result

6.3.1.1 Baseline parameters

In [137]:
pd.options.display.max_rows = 200 # 200 max rows for data inspection
In [138]:
rConf_list_dow, rLift_list_dow, cmpt_list_dow, subs_list_dow = ACL_per_split(df_dow_list)
In [139]:
df_subs_dow = pd.DataFrame()
for i in range(0,len(subs_list_dow)):
    df_insert = subs_list_dow[i].iloc[:5,:][["antecedents","consequents","lift",'confidence','support']]
    df_insert["index"] = i
    df_subs_dow = df_subs_dow.append(df_insert)
df_subs_dow.sort_values(by=['index','support'], ascending=[True,False])
Out[139]:
antecedents consequents lift confidence support index
173 (fresh vegetables) (water seltzer sparkling water) 1.001696 0.191127 0.102058 0
172 (water seltzer sparkling water) (fresh vegetables) 1.001696 0.534887 0.102058 0
147 (fresh vegetables) (ice cream ice) 1.003870 0.119481 0.063801 0
36 (cereal) (fresh vegetables) 1.003836 0.536030 0.053652 0
37 (fresh vegetables) (cereal) 1.003836 0.100475 0.053652 0
109 (fresh vegetables) (water seltzer sparkling water) 0.949087 0.184882 0.080965 1
108 (water seltzer sparkling water) (fresh vegetables) 0.949087 0.415633 0.080965 1
133 (water seltzer sparkling water) (packaged vegetables fruits) 1.011725 0.377821 0.073600 1
239 (fresh fruits, fresh vegetables) (water seltzer sparkling water) 1.007130 0.196189 0.062653 1
242 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.007130 0.321626 0.062653 1
91 (fresh vegetables) (water seltzer sparkling water) 0.961079 0.185120 0.076805 2
90 (water seltzer sparkling water) (fresh vegetables) 0.961079 0.398744 0.076805 2
110 (packaged vegetables fruits) (water seltzer sparkling water) 1.029172 0.198236 0.068373 2
185 (fresh fruits, fresh vegetables) (water seltzer sparkling water) 1.027428 0.197900 0.057338 2
188 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.027428 0.297678 0.057338 2
65 (water seltzer sparkling water) (fresh fruits) 1.003114 0.521447 0.099922 3
64 (fresh fruits) (water seltzer sparkling water) 1.003114 0.192221 0.099922 3
85 (fresh vegetables) (water seltzer sparkling water) 0.945355 0.181153 0.072341 3
84 (water seltzer sparkling water) (fresh vegetables) 0.945355 0.377516 0.072341 3
171 (fresh fruits, fresh vegetables) (water seltzer sparkling water) 1.019049 0.195275 0.053448 3
66 (fresh fruits) (water seltzer sparkling water) 1.037896 0.198860 0.103388 4
85 (fresh vegetables) (water seltzer sparkling water) 0.963209 0.184550 0.073276 4
84 (water seltzer sparkling water) (fresh vegetables) 0.963209 0.382445 0.073276 4
179 (fresh fruits, fresh vegetables) (water seltzer sparkling water) 1.036892 0.198668 0.054937 4
182 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.036892 0.286729 0.054937 4
97 (fresh vegetables) (water seltzer sparkling water) 0.972161 0.193026 0.078171 5
96 (water seltzer sparkling water) (fresh vegetables) 0.972161 0.393704 0.078171 5
117 (water seltzer sparkling water) (packaged vegetables fruits) 1.041015 0.353499 0.070189 5
116 (packaged vegetables fruits) (water seltzer sparkling water) 1.041015 0.206698 0.070189 5
197 (fresh fruits, fresh vegetables) (water seltzer sparkling water) 1.041153 0.206725 0.059495 5
143 (fresh vegetables) (water seltzer sparkling water) 0.997933 0.190669 0.093961 6
142 (water seltzer sparkling water) (fresh vegetables) 0.997933 0.491777 0.093961 6
87 (ice cream ice) (fresh fruits) 1.034761 0.588318 0.080281 6
122 (ice cream ice) (fresh vegetables) 0.990400 0.488065 0.066600 6
123 (fresh vegetables) (ice cream ice) 0.990400 0.135148 0.066600 6

lift values higher than 1 in many cases. This is not a problem per se, since people still buy substitutes together
(e.g. pretzels can be a substitute for mini carrots, but you still buy them together often...even more so in this case because "fresh vegetables" include such a wide range of products)
But still, need to play with different parameters for ACL_per_split(), especially the ms value,
and maybe a restriction on max lift value.

In [140]:
df_cmpt_dow = pd.DataFrame()
for i in range(0,len(cmpt_list_dow)):
    df_insert = cmpt_list_dow[i].iloc[:5,:][["antecedents","consequents","lift",'confidence','support']]
    df_insert["index"] = i
    df_cmpt_dow = df_cmpt_dow.append(df_insert)
df_cmpt_dow.sort_values(by=['index','support'], ascending=[True,False])
Out[140]:
antecedents consequents lift confidence support index
33 (fresh herbs) (fresh vegetables) 1.951179 0.854479 0.074706 1
65 (fresh fruits, fresh herbs) (fresh vegetables) 2.025758 0.887139 0.059043 1
28 (fresh herbs) (fresh vegetables) 2.007477 0.832885 0.068153 2
47 (fresh fruits, fresh herbs) (fresh vegetables) 2.092095 0.867993 0.052792 2
26 (fresh herbs) (fresh vegetables) 2.077131 0.829477 0.064238 3
27 (fresh herbs) (fresh vegetables) 2.082811 0.826986 0.070433 4
45 (fresh fruits, fresh herbs) (fresh vegetables) 2.180681 0.865846 0.053496 4
30 (fresh herbs) (fresh vegetables) 2.023854 0.819616 0.072373 5
51 (fresh fruits, fresh herbs) (fresh vegetables) 2.114490 0.856322 0.056106 5
88 (fresh fruits, fresh herbs) (fresh vegetables) 1.804170 0.889088 0.076180 6
135 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.829883 0.901759 0.056982 6

6.3.1.2 Parameter Adjustment for Substitutes

In [141]:
rConf_list_dow2, rLift_list_dow2, cmpt_list_dow2, subs_list_dow2 = ACL_per_split(df_dow_list, ms=0.04, cmptc=0.7, cmptl=1.7)
In [142]:
# Ranked by lift
df_subs_dow2 = pd.DataFrame()
for i in range(0,len(subs_list_dow2)):
    df_insert = subs_list_dow2[i].iloc[:20,:][["antecedents","consequents","lift",'support']]
    df_insert["index"] = i
    df_subs_dow2 = df_subs_dow2.append(df_insert)
df_subs_dow2.iloc[::2] #select even (or odd) number index value to exclude opposite direction duplicates.(Note: This should NOT be done for complementary products as the direction matters a lot more than substitutes)
Out[142]:
antecedents consequents lift support index
267 (milk) (soy lactosefree) 0.872122 0.041279 0
188 (frozen meals) (fresh vegetables) 0.984349 0.045613 0
269 (milk) (water seltzer sparkling water) 0.994827 0.049205 0
229 (fresh vegetables) (water seltzer sparkling water) 1.001696 0.102058 0
48 (cereal) (fresh vegetables) 1.003836 0.053652 0
195 (fresh vegetables) (ice cream ice) 1.003870 0.063801 0
617 (fresh fruits, fresh vegetables) (ice cream ice) 1.019606 0.048036 0
143 (ice cream ice) (fresh fruits) 1.022712 0.074406 0
197 (fresh vegetables) (juice nectars) 1.026134 0.049205 0
217 (fresh vegetables) (refrigerated) 1.034967 0.077199 0
130 (fresh fruits) (soft drinks) 0.815279 0.040555 1
179 (fresh vegetables) (water seltzer sparkling water) 0.949087 0.080965 1
82 (energy granola bars) (fresh vegetables) 0.973196 0.042448 1
63 (fresh vegetables) (cream) 0.993942 0.042390 1
550 (water seltzer sparkling water) (packaged vegetables fruits, fresh vegetables) 0.994119 0.046204 1
412 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.007130 0.062653 1
219 (water seltzer sparkling water) (packaged vegetables fruits) 1.011725 0.073600 1
22 (candy chocolate) (fresh fruits) 1.016374 0.042215 1
138 (fresh fruits) (water seltzer sparkling water) 1.035642 0.116572 1
201 (milk) (water seltzer sparkling water) 1.039780 0.050425 1
148 (water seltzer sparkling water) (fresh vegetables) 0.961079 0.076805 2
164 (water seltzer sparkling water) (milk) 0.987431 0.044946 2
434 (water seltzer sparkling water) (packaged vegetables fruits, fresh vegetables) 1.025043 0.042417 2
311 (fresh fruits, fresh vegetables) (water seltzer sparkling water) 1.027428 0.057338 2
181 (water seltzer sparkling water) (packaged vegetables fruits) 1.029172 0.068373 2
117 (water seltzer sparkling water) (fresh fruits) 1.029528 0.104594 2
20 (candy chocolate) (fresh fruits) 1.035930 0.040804 2
58 (cream) (fresh vegetables) 1.057909 0.040437 2
173 (packaged cheese) (water seltzer sparkling water) 1.066624 0.044506 2
92 (fresh fruits) (ice cream ice) 1.067395 0.055541 2
121 (fresh vegetables) (water seltzer sparkling water) 0.945355 0.072341 3
133 (milk) (water seltzer sparkling water) 0.959463 0.041527 3
93 (water seltzer sparkling water) (fresh fruits) 1.003114 0.099922 3
262 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.019049 0.053448 3
147 (water seltzer sparkling water) (packaged vegetables fruits) 1.024465 0.064706 3
139 (packaged cheese) (water seltzer sparkling water) 1.032785 0.041254 3
102 (ice cream ice) (fresh vegetables) 1.038007 0.043163 3
323 (water seltzer sparkling water) (fresh fruits, packaged vegetables fruits) 1.050759 0.047604 3
68 (fresh fruits) (ice cream ice) 1.062251 0.057499 3
154 (yogurt) (water seltzer sparkling water) 1.097309 0.050721 3
97 (soft drinks) (fresh fruits) 0.832696 0.040442 4
135 (fresh vegetables) (water seltzer sparkling water) 0.963209 0.073276 4
149 (milk) (water seltzer sparkling water) 0.982703 0.044406 4
286 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.036892 0.054937 4
104 (fresh fruits) (water seltzer sparkling water) 1.037896 0.103388 4
78 (fresh fruits) (ice cream ice) 1.052562 0.057300 4
403 (packaged vegetables fruits, fresh vegetables) (water seltzer sparkling water) 1.065906 0.041403 4
164 (packaged vegetables fruits) (water seltzer sparkling water) 1.066246 0.067470 4
0 (fresh fruits) (baking ingredients) 1.098130 0.044807 4
342 (fresh fruits, packaged vegetables fruits) (water seltzer sparkling water) 1.101739 0.050292 4
121 (soft drinks) (fresh fruits) 0.821515 0.044320 5
182 (water seltzer sparkling water) (milk) 0.951843 0.045901 5
163 (fresh vegetables) (water seltzer sparkling water) 0.972161 0.078171 5
488 (water seltzer sparkling water) (packaged vegetables fruits, fresh vegetables) 1.019602 0.041759 5
198 (packaged vegetables fruits) (water seltzer sparkling water) 1.041015 0.070189 5
356 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.041153 0.059495 5
23 (fresh fruits) (candy chocolate) 1.042333 0.042249 5
129 (water seltzer sparkling water) (fresh fruits) 1.047411 0.112437 5
190 (water seltzer sparkling water) (packaged cheese) 1.048582 0.046654 5
98 (fresh fruits) (ice cream ice) 1.051474 0.067365 5
147 (soft drinks) (fresh fruits) 0.849303 0.043761 6
164 (frozen meals) (fresh vegetables) 0.975458 0.041999 6
170 (ice cream ice) (fresh vegetables) 0.990400 0.066600 6
200 (water seltzer sparkling water) (fresh vegetables) 0.997933 0.093961 6
235 (milk) (water seltzer sparkling water) 1.011483 0.047670 6
42 (cereal) (fresh vegetables) 1.023846 0.048896 6
173 (fresh vegetables) (juice nectars) 1.028017 0.048513 6
478 (ice cream ice) (fresh fruits, fresh vegetables) 1.028194 0.049356 6
125 (ice cream ice) (fresh fruits) 1.034761 0.080281 6
216 (packaged vegetables fruits) (ice cream ice) 1.038094 0.054874 6
In [143]:
# Lowest 10 Lift values selected, then ranked by Support
df_subs_dow2 = pd.DataFrame()
for i in range(0,len(subs_list_dow2)):
    df_insert = subs_list_dow2[i].iloc[:20,:][["antecedents","consequents","lift",'support']]
    df_insert["index"] = i
    df_subs_dow2 = df_subs_dow2.append(df_insert)
# Select even (or odd) number index value to exclude opposite direction duplicates, then rank by support.
dfs_subs_final = df_subs_dow2.iloc[::2].sort_values(by=['index','support'], ascending=[True,False])
In [144]:
dfs_subs_final.reset_index(drop=True, inplace=True)
# dfs_subs_final.to_excel('dfs_subs_final.xlsx') # save for qualitative discussion
In [145]:
#### 6.3.1.3 Parameter Adjustment and Dominant Element Exclusion for Complements
In [146]:
# Ranked by confidence
for i in range(len(cmpt_list_dow2)):
    cmpt_list_dow2[i].sort_values(by='confidence', ascending=False, inplace=True)



df_cmpt_dow2 = pd.DataFrame()
for i in range(0,len(cmpt_list_dow2)):
    df_insert = cmpt_list_dow2[i].iloc[:30,:][["antecedents","consequents","lift",'confidence','support']]
    df_insert["index"] = i
    df_cmpt_dow2 = df_cmpt_dow2.append(df_insert)
df_cmpt_dow2.sort_values(by=['index','support'], ascending=[True,False])
Out[146]:
antecedents consequents lift confidence support index
188 (fresh herbs) (fresh fruits, fresh vegetables) 1.789954 0.708521 0.080592 0
268 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.713176 0.914804 0.061834 0
336 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.735559 0.926756 0.053025 0
340 (packaged vegetables fruits, fresh herbs) (fresh fruits, fresh vegetables) 1.981848 0.784479 0.053025 0
419 (packaged vegetables fruits, packaged cheese, ... (fresh fruits, fresh vegetables) 1.770898 0.700978 0.046981 0
49 (fresh herbs) (fresh vegetables) 1.951179 0.854479 0.074706 1
202 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.716793 0.751834 0.068621 1
105 (fresh fruits, fresh herbs) (fresh vegetables) 2.025758 0.887139 0.059043 1
9 (canned jarred vegetables) (fresh vegetables) 1.776573 0.778014 0.053366 1
99 (packaged vegetables fruits, eggs) (fresh vegetables) 1.704758 0.746564 0.050600 1
155 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 2.045175 0.895642 0.045476 1
181 (fresh fruits, packaged vegetables fruits, eggs) (fresh vegetables) 1.791242 0.784437 0.043438 1
186 (fresh fruits, packaged vegetables fruits, fro... (fresh vegetables) 1.709911 0.748820 0.041574 1
125 (fresh fruits, soup broth bouillon) (fresh vegetables) 1.754708 0.768438 0.041254 1
78 (fresh fruits, canned jarred vegetables) (fresh vegetables) 1.935240 0.847498 0.040934 1
124 (packaged vegetables fruits, packaged cheese) (fresh vegetables) 1.703766 0.706878 0.073468 2
43 (fresh herbs) (fresh vegetables) 2.007477 0.832885 0.068153 2
157 (fresh fruits, packaged vegetables fruits, yog... (fresh vegetables) 1.717199 0.712451 0.067126 2
145 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.792833 0.743831 0.060784 2
79 (fresh fruits, fresh herbs) (fresh vegetables) 2.092095 0.867993 0.052792 2
8 (canned jarred vegetables) (fresh vegetables) 1.795503 0.744939 0.047219 2
10 (canned meals beans) (fresh vegetables) 1.712017 0.710302 0.045753 2
152 (fresh fruits, packaged vegetables fruits, soy... (fresh vegetables) 1.770018 0.734366 0.045203 2
75 (packaged vegetables fruits, eggs) (fresh vegetables) 1.772560 0.735421 0.043920 2
118 (frozen produce, packaged vegetables fruits) (fresh vegetables) 1.707763 0.708537 0.042600 2
135 (fresh fruits, packaged vegetables fruits, bread) (fresh vegetables) 1.758078 0.729412 0.040914 2
36 (fresh herbs) (fresh vegetables) 2.077131 0.829477 0.064238 3
116 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.837074 0.733613 0.054499 3
64 (fresh fruits, fresh herbs) (fresh vegetables) 2.182209 0.871438 0.048851 3
4 (canned jarred vegetables) (fresh vegetables) 1.821770 0.727502 0.045033 3
62 (packaged vegetables fruits, eggs) (fresh vegetables) 1.793071 0.716041 0.040865 3
123 (fresh fruits, packaged vegetables fruits, soy... (fresh vegetables) 1.820078 0.726826 0.040319 3
37 (fresh herbs) (fresh vegetables) 2.082811 0.826986 0.070433 4
132 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.829928 0.726578 0.057139 4
71 (fresh fruits, fresh herbs) (fresh vegetables) 2.180681 0.865846 0.053496 4
6 (canned jarred vegetables) (fresh vegetables) 1.880072 0.746488 0.044686 4
67 (packaged vegetables fruits, eggs) (fresh vegetables) 1.780122 0.706803 0.041603 4
109 (frozen produce, packaged vegetables fruits) (fresh vegetables) 1.831185 0.727077 0.040642 4
107 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 2.207027 0.876307 0.040282 4
47 (fresh herbs) (fresh vegetables) 2.023854 0.819616 0.072373 5
159 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.761003 0.713168 0.058327 5
87 (fresh fruits, fresh herbs) (fresh vegetables) 2.114490 0.856322 0.056106 5
9 (canned jarred vegetables) (fresh vegetables) 1.783668 0.722346 0.048688 5
166 (fresh fruits, packaged vegetables fruits, soy... (fresh vegetables) 1.731600 0.701260 0.044019 5
131 (frozen produce, packaged vegetables fruits) (fresh vegetables) 1.763764 0.714286 0.043303 5
129 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 2.150958 0.871091 0.043002 5
149 (fresh fruits, packaged vegetables fruits, bread) (fresh vegetables) 1.757840 0.711886 0.041496 5
59 (fresh herbs) (fresh vegetables) 1.730002 0.852538 0.095264 6
139 (fresh fruits, fresh herbs) (fresh vegetables) 1.804170 0.889088 0.076180 6
198 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.829883 0.901759 0.056982 6
100 (fresh fruits, canned jarred vegetables) (fresh vegetables) 1.750927 0.862850 0.049663 6
240 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.861373 0.917277 0.048015 6
244 (packaged vegetables fruits, fresh herbs) (fresh fruits, fresh vegetables) 2.160039 0.759854 0.048015 6
300 (fresh fruits, yogurt, packaged cheese, fresh ... (packaged vegetables fruits) 1.828753 0.708418 0.040313 6
In [147]:
df = df_dow_list
df[0]
Out[147]:
order_id user_id order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered department_id product_name department tofd
9 2772240 82371 23 0 8 7.0 51 3 1 13 preserved dips spreads pantry 0
13 325441 47716 5 0 16 30.0 29 11 0 13 honeys syrups nectars pantry 1
24 2749827 104844 27 0 14 7.0 104 1 1 13 spices seasonings pantry 1
25 2749827 104844 27 0 14 7.0 72 15 0 13 condiments pantry 1
28 2101376 171617 9 0 9 1.0 51 9 0 13 preserved dips spreads pantry 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019482 3396980 14573 4 0 16 6.0 6 2 0 2 other other 1
2019486 91813 195573 63 0 21 3.0 6 4 1 2 other other 3
2019488 2816975 48477 1 0 17 NaN 6 5 0 2 other other 2
2019493 2684142 62337 11 0 15 9.0 6 3 0 2 other other 1
2019497 1709950 73422 10 0 14 2.0 6 3 0 2 other other 1

391831 rows × 13 columns

Complementary by exclusion, 1st element (fresh vegetables)

In [148]:
df = df_dow_list

dfs_filt=[]
for i in range(len(df)):
    novege_input = df[i][~df[i].product_name.str.contains("fresh vegetables")]
    dfs_filt.append(novege_input)
In [149]:
rConf_list_filt, rLift_list_filt, cmpt_list_filt, subs_list_filt = ACL_per_split(dfs_filt, ms=0.03, cmptc=0.8, cmptl=1.2)
In [150]:
# Ranked by confidence
for i in range(len(cmpt_list_filt)):
    cmpt_list_filt[i].sort_values(by='confidence', ascending=False, inplace=True)

df_cmpt_dow2 = pd.DataFrame()
for i in range(0,len(cmpt_list_filt)):
    df_insert = cmpt_list_filt[i].iloc[:10,:][["antecedents","consequents","lift",'confidence','support']]
    df_insert["index"] = i
    df_cmpt_dow2 = df_cmpt_dow2.append(df_insert)
df_cmpt_dow2.sort_values(by=['index','support'], ascending=[True,False])
Out[150]:
antecedents consequents lift confidence support index
268 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 1.425941 0.874521 0.058807 0
261 (packaged vegetables fruits, milk, yogurt) (fresh fruits) 1.447941 0.888014 0.058063 0
258 (yogurt, milk, packaged cheese) (fresh fruits) 1.421668 0.871901 0.042246 0
241 (packaged vegetables fruits, bread, yogurt) (fresh fruits) 1.435208 0.880205 0.039300 0
271 (packaged vegetables fruits, soy lactosefree, ... (fresh fruits) 1.464624 0.898246 0.036611 0
234 (packaged vegetables fruits, bread, milk) (fresh fruits) 1.416416 0.868680 0.035381 0
245 (packaged vegetables fruits, chips pretzels, y... (fresh fruits) 1.427020 0.875184 0.034094 0
252 (frozen produce, packaged vegetables fruits, y... (fresh fruits) 1.434430 0.879728 0.033265 0
154 (yogurt, fresh herbs) (fresh fruits) 1.422834 0.872616 0.032721 0
264 (packaged vegetables fruits, soy lactosefree, ... (fresh fruits) 1.427895 0.875720 0.030433 0
142 (packaged vegetables fruits, yogurt) (fresh fruits) 1.439056 0.833994 0.110611 1
117 (packaged vegetables fruits, milk) (fresh fruits) 1.439426 0.834208 0.092858 1
136 (packaged vegetables fruits, soy lactosefree) (fresh fruits) 1.438114 0.833448 0.071016 1
158 (packaged vegetables fruits, milk, yogurt) (fresh fruits) 1.518819 0.880220 0.046779 1
160 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 1.495173 0.866516 0.044735 1
153 (packaged vegetables fruits, milk, packaged ch... (fresh fruits) 1.483130 0.859536 0.038953 1
95 (energy granola bars, packaged vegetables fruits) (fresh fruits) 1.459155 0.845642 0.037114 1
155 (yogurt, milk, packaged cheese) (fresh fruits) 1.449573 0.840088 0.033289 1
123 (packaged vegetables fruits, nuts seeds dried ... (fresh fruits) 1.440496 0.834828 0.032617 1
151 (packaged vegetables fruits, bread, yogurt) (fresh fruits) 1.527233 0.885096 0.031040 1
96 (packaged vegetables fruits, yogurt) (fresh fruits) 1.532900 0.810981 0.094506 2
93 (packaged vegetables fruits, soy lactosefree) (fresh fruits) 1.527978 0.808377 0.061742 2
108 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 1.619519 0.856808 0.040266 2
63 (packaged vegetables fruits, crackers) (fresh fruits) 1.512143 0.800000 0.038979 2
106 (packaged vegetables fruits, milk, yogurt) (fresh fruits) 1.593260 0.842915 0.038281 2
74 (packaged vegetables fruits, fresh herbs) (fresh fruits) 1.538374 0.813878 0.036662 2
104 (packaged vegetables fruits, milk, packaged ch... (fresh fruits) 1.592992 0.842773 0.031735 2
69 (energy granola bars, packaged vegetables fruits) (fresh fruits) 1.565451 0.828203 0.030669 2
83 (packaged vegetables fruits, yogurt) (fresh fruits) 1.559932 0.813559 0.088173 3
70 (packaged vegetables fruits, lunch meat) (fresh fruits) 1.555057 0.811017 0.037403 3
92 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 1.618688 0.844203 0.036426 3
90 (packaged vegetables fruits, milk, yogurt) (fresh fruits) 1.641099 0.855891 0.034355 3
65 (packaged vegetables fruits, fresh herbs) (fresh fruits) 1.569783 0.818697 0.033886 3
90 (packaged vegetables fruits, yogurt) (fresh fruits) 1.557145 0.812293 0.088670 4
77 (packaged vegetables fruits, milk) (fresh fruits) 1.535171 0.800830 0.077541 4
87 (packaged vegetables fruits, soy lactosefree) (fresh fruits) 1.545075 0.805996 0.055082 4
52 (packaged vegetables fruits, bread) (fresh fruits) 1.560898 0.814251 0.053716 4
71 (frozen produce, packaged vegetables fruits) (fresh fruits) 1.544840 0.805874 0.045199 4
101 (packaged vegetables fruits, milk, yogurt) (fresh fruits) 1.650206 0.860839 0.038771 4
103 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 1.659734 0.865809 0.037847 4
69 (packaged vegetables fruits, fresh herbs) (fresh fruits) 1.561299 0.814460 0.037565 4
98 (packaged vegetables fruits, milk, packaged ch... (fresh fruits) 1.603178 0.836306 0.032021 4
109 (packaged vegetables fruits, yogurt) (fresh fruits) 1.522495 0.825809 0.097427 5
93 (packaged vegetables fruits, milk) (fresh fruits) 1.494032 0.810370 0.082656 5
106 (packaged vegetables fruits, soy lactosefree) (fresh fruits) 1.519955 0.824431 0.062975 5
122 (packaged vegetables fruits, milk, yogurt) (fresh fruits) 1.590229 0.862548 0.042197 5
124 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 1.562236 0.847364 0.040686 5
80 (packaged vegetables fruits, fresh herbs) (fresh fruits) 1.514570 0.821510 0.040686 5
89 (packaged vegetables fruits, lunch meat) (fresh fruits) 1.490124 0.808251 0.039968 5
117 (packaged vegetables fruits, milk, packaged ch... (fresh fruits) 1.562182 0.847335 0.035435 5
119 (yogurt, milk, packaged cheese) (fresh fruits) 1.540731 0.835700 0.031128 5
90 (yogurt, lunch meat) (fresh fruits) 1.503277 0.815385 0.030033 5
194 (packaged vegetables fruits, packaged cheese, ... (fresh fruits) 1.532048 0.873933 0.051173 6
192 (packaged vegetables fruits, milk, yogurt) (fresh fruits) 1.522479 0.868475 0.049250 6
187 (packaged vegetables fruits, milk, packaged ch... (fresh fruits) 1.495862 0.853291 0.041369 6
85 (packaged vegetables fruits, canned meals beans) (fresh fruits) 1.467026 0.836842 0.036678 6
189 (yogurt, milk, packaged cheese) (fresh fruits) 1.485608 0.847442 0.035025 6
182 (packaged vegetables fruits, bread, yogurt) (fresh fruits) 1.513069 0.863107 0.034179 6
120 (yogurt, fresh herbs) (fresh fruits) 1.481643 0.845180 0.033372 6
197 (packaged vegetables fruits, soy lactosefree, ... (fresh fruits) 1.523593 0.869110 0.031911 6
179 (packaged vegetables fruits, bread, packaged c... (fresh fruits) 1.467207 0.836945 0.031180 6
183 (packaged vegetables fruits, eggs, yogurt) (fresh fruits) 1.549712 0.884009 0.030181 6

2nd element (fresh fruits) exclusion

In [151]:
df = df_dow_list

dfs_filt=[]
for i in range(len(df)):
    novege_input = df[i][~df[i].product_name.str.contains("fresh vegetables")]
    novege_input2 = novege_input[~novege_input.product_name.str.contains("fresh fruits")]
    dfs_filt.append(novege_input2)
In [152]:
rConf_list_filt, rLift_list_filt, cmpt_list_filt, subs_list_filt = ACL_per_split(dfs_filt, ms=0.03, cmptc=0.65, cmptl=1.2)
In [153]:
# Ranked by confidence
for i in range(len(cmpt_list_filt)):
    cmpt_list_filt[i].sort_values(by='confidence', ascending=False, inplace=True)

df_cmpt_dow2 = pd.DataFrame()
for i in range(0,len(cmpt_list_filt)):
    df_insert = cmpt_list_filt[i].iloc[:10,:][["antecedents","consequents","lift",'confidence','support']]
    df_insert["index"] = i
    df_cmpt_dow2 = df_cmpt_dow2.append(df_insert)
df_cmpt_dow2.sort_values(by=['index','support'], ascending=[True,False])
Out[153]:
antecedents consequents lift confidence support index
42 (frozen produce, packaged cheese) (packaged vegetables fruits) 1.517826 0.656267 0.033929 0
58 (yogurt, milk, packaged cheese) (packaged vegetables fruits) 1.544153 0.667651 0.032575 0
41 (fresh dips tapenades, yogurt) (packaged vegetables fruits) 1.541878 0.666667 0.030069 0

3rd element (packaged vegetables fruits) exclusion

In [154]:
df = df_dow_list

dfs_filt=[]
for i in range(len(df)):
    novege_input = df[i][~df[i].product_name.str.contains("fresh vegetables")]
    novege_input2 = novege_input[~novege_input.product_name.str.contains("fresh fruits")]
    novege_input3 = novege_input2[~novege_input2.product_name.str.contains("packaged vegetables fruits")]
    dfs_filt.append(novege_input3)
In [155]:
rConf_list_filt, rLift_list_filt, cmpt_list_filt, subs_list_filt = ACL_per_split(dfs_filt, ms=0.02, cmptc=0.5, cmptl=1.1)
# Ranked by confidence
for i in range(len(cmpt_list_filt)):
    cmpt_list_filt[i].sort_values(by='confidence', ascending=False, inplace=True)

df_cmpt_dow2 = pd.DataFrame()
for i in range(0,len(cmpt_list_filt)):
    df_insert = cmpt_list_filt[i].iloc[:10,:][["antecedents","consequents","lift",'confidence','support']]
    df_insert["index"] = i
    df_cmpt_dow2 = df_cmpt_dow2.append(df_insert)
df_cmpt_dow2.sort_values(by=['index','support'], ascending=[True,False])
Out[155]:
antecedents consequents lift confidence support index
12 (milk, packaged cheese) (yogurt) 1.702227 0.514581 0.049103 0
2 (bread, milk) (yogurt) 1.673986 0.506044 0.036407 0
11 (yogurt, lunch meat) (packaged cheese) 1.916885 0.518519 0.026378 0
8 (frozen produce, packaged cheese) (yogurt) 1.669658 0.504735 0.026262 0
5 (yogurt, crackers) (packaged cheese) 1.879861 0.508503 0.026001 0
7 (frozen produce, milk) (yogurt) 1.739756 0.525926 0.024696 0
10 (milk, lunch meat) (yogurt) 1.712066 0.517555 0.023073 0
9 (milk, lunch meat) (packaged cheese) 1.898902 0.513654 0.022899 0
4 (milk, crackers) (yogurt) 1.733295 0.523973 0.022175 0
1 (bread, lunch meat) (packaged cheese) 1.956100 0.529126 0.022117 0
0 (yogurt, lunch meat) (packaged cheese) 2.179116 0.504640 0.021047 1
0 (baby food formula) (yogurt) 1.824236 0.517688 0.023373 6
1 (frozen produce, milk) (yogurt) 1.877361 0.532764 0.021853 6

Consideration:

  • Complementary products are dominated by vegetables, fresh fruits, fresh herbs regardless of the parameters. Their complementarity makes sense, but they are also highly popular products.
  • file:///C:/Users/Lenovo/Desktop/NOVA/S2/Business%20Cases/BC3/jtaer-16-00039-v2.pdf maybe
In [156]:
### Substitutes mapping
In [157]:
dfs_subs_final
Out[157]:
antecedents consequents lift support index
0 (fresh vegetables) (water seltzer sparkling water) 1.001696 0.102058 0
1 (fresh vegetables) (refrigerated) 1.034967 0.077199 0
2 (ice cream ice) (fresh fruits) 1.022712 0.074406 0
3 (fresh vegetables) (ice cream ice) 1.003870 0.063801 0
4 (cereal) (fresh vegetables) 1.003836 0.053652 0
5 (milk) (water seltzer sparkling water) 0.994827 0.049205 0
6 (fresh vegetables) (juice nectars) 1.026134 0.049205 0
7 (fresh fruits, fresh vegetables) (ice cream ice) 1.019606 0.048036 0
8 (frozen meals) (fresh vegetables) 0.984349 0.045613 0
9 (milk) (soy lactosefree) 0.872122 0.041279 0
10 (fresh fruits) (water seltzer sparkling water) 1.035642 0.116572 1
11 (fresh vegetables) (water seltzer sparkling water) 0.949087 0.080965 1
12 (water seltzer sparkling water) (packaged vegetables fruits) 1.011725 0.073600 1
13 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.007130 0.062653 1
14 (milk) (water seltzer sparkling water) 1.039780 0.050425 1
15 (water seltzer sparkling water) (packaged vegetables fruits, fresh vegetables) 0.994119 0.046204 1
16 (energy granola bars) (fresh vegetables) 0.973196 0.042448 1
17 (fresh vegetables) (cream) 0.993942 0.042390 1
18 (candy chocolate) (fresh fruits) 1.016374 0.042215 1
19 (fresh fruits) (soft drinks) 0.815279 0.040555 1
20 (water seltzer sparkling water) (fresh fruits) 1.029528 0.104594 2
21 (water seltzer sparkling water) (fresh vegetables) 0.961079 0.076805 2
22 (water seltzer sparkling water) (packaged vegetables fruits) 1.029172 0.068373 2
23 (fresh fruits, fresh vegetables) (water seltzer sparkling water) 1.027428 0.057338 2
24 (fresh fruits) (ice cream ice) 1.067395 0.055541 2
25 (water seltzer sparkling water) (milk) 0.987431 0.044946 2
26 (packaged cheese) (water seltzer sparkling water) 1.066624 0.044506 2
27 (water seltzer sparkling water) (packaged vegetables fruits, fresh vegetables) 1.025043 0.042417 2
28 (candy chocolate) (fresh fruits) 1.035930 0.040804 2
29 (cream) (fresh vegetables) 1.057909 0.040437 2
30 (water seltzer sparkling water) (fresh fruits) 1.003114 0.099922 3
31 (fresh vegetables) (water seltzer sparkling water) 0.945355 0.072341 3
32 (water seltzer sparkling water) (packaged vegetables fruits) 1.024465 0.064706 3
33 (fresh fruits) (ice cream ice) 1.062251 0.057499 3
34 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.019049 0.053448 3
35 (yogurt) (water seltzer sparkling water) 1.097309 0.050721 3
36 (water seltzer sparkling water) (fresh fruits, packaged vegetables fruits) 1.050759 0.047604 3
37 (ice cream ice) (fresh vegetables) 1.038007 0.043163 3
38 (milk) (water seltzer sparkling water) 0.959463 0.041527 3
39 (packaged cheese) (water seltzer sparkling water) 1.032785 0.041254 3
40 (fresh fruits) (water seltzer sparkling water) 1.037896 0.103388 4
41 (fresh vegetables) (water seltzer sparkling water) 0.963209 0.073276 4
42 (packaged vegetables fruits) (water seltzer sparkling water) 1.066246 0.067470 4
43 (fresh fruits) (ice cream ice) 1.052562 0.057300 4
44 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.036892 0.054937 4
45 (fresh fruits, packaged vegetables fruits) (water seltzer sparkling water) 1.101739 0.050292 4
46 (fresh fruits) (baking ingredients) 1.098130 0.044807 4
47 (milk) (water seltzer sparkling water) 0.982703 0.044406 4
48 (packaged vegetables fruits, fresh vegetables) (water seltzer sparkling water) 1.065906 0.041403 4
49 (soft drinks) (fresh fruits) 0.832696 0.040442 4
50 (water seltzer sparkling water) (fresh fruits) 1.047411 0.112437 5
51 (fresh vegetables) (water seltzer sparkling water) 0.972161 0.078171 5
52 (packaged vegetables fruits) (water seltzer sparkling water) 1.041015 0.070189 5
53 (fresh fruits) (ice cream ice) 1.051474 0.067365 5
54 (water seltzer sparkling water) (fresh fruits, fresh vegetables) 1.041153 0.059495 5
55 (water seltzer sparkling water) (packaged cheese) 1.048582 0.046654 5
56 (water seltzer sparkling water) (milk) 0.951843 0.045901 5
57 (soft drinks) (fresh fruits) 0.821515 0.044320 5
58 (fresh fruits) (candy chocolate) 1.042333 0.042249 5
59 (water seltzer sparkling water) (packaged vegetables fruits, fresh vegetables) 1.019602 0.041759 5
60 (water seltzer sparkling water) (fresh vegetables) 0.997933 0.093961 6
61 (ice cream ice) (fresh fruits) 1.034761 0.080281 6
62 (ice cream ice) (fresh vegetables) 0.990400 0.066600 6
63 (packaged vegetables fruits) (ice cream ice) 1.038094 0.054874 6
64 (ice cream ice) (fresh fruits, fresh vegetables) 1.028194 0.049356 6
65 (cereal) (fresh vegetables) 1.023846 0.048896 6
66 (fresh vegetables) (juice nectars) 1.028017 0.048513 6
67 (milk) (water seltzer sparkling water) 1.011483 0.047670 6
68 (soft drinks) (fresh fruits) 0.849303 0.043761 6
69 (frozen meals) (fresh vegetables) 0.975458 0.041999 6
In [158]:
dfs_subs_selected = dfs_subs_final.drop(dfs_subs_final.index[[0,7,10,11,12,13,15,17,19,20,21,22,23,26,27,29,30,31,32,34,36,39,40,41,42,44,45,48,49,50,51,52,54,55,59,60,64,68]])
dfs_subs_selected
Out[158]:
antecedents consequents lift support index
1 (fresh vegetables) (refrigerated) 1.034967 0.077199 0
2 (ice cream ice) (fresh fruits) 1.022712 0.074406 0
3 (fresh vegetables) (ice cream ice) 1.003870 0.063801 0
4 (cereal) (fresh vegetables) 1.003836 0.053652 0
5 (milk) (water seltzer sparkling water) 0.994827 0.049205 0
6 (fresh vegetables) (juice nectars) 1.026134 0.049205 0
8 (frozen meals) (fresh vegetables) 0.984349 0.045613 0
9 (milk) (soy lactosefree) 0.872122 0.041279 0
14 (milk) (water seltzer sparkling water) 1.039780 0.050425 1
16 (energy granola bars) (fresh vegetables) 0.973196 0.042448 1
18 (candy chocolate) (fresh fruits) 1.016374 0.042215 1
24 (fresh fruits) (ice cream ice) 1.067395 0.055541 2
25 (water seltzer sparkling water) (milk) 0.987431 0.044946 2
28 (candy chocolate) (fresh fruits) 1.035930 0.040804 2
33 (fresh fruits) (ice cream ice) 1.062251 0.057499 3
35 (yogurt) (water seltzer sparkling water) 1.097309 0.050721 3
37 (ice cream ice) (fresh vegetables) 1.038007 0.043163 3
38 (milk) (water seltzer sparkling water) 0.959463 0.041527 3
43 (fresh fruits) (ice cream ice) 1.052562 0.057300 4
46 (fresh fruits) (baking ingredients) 1.098130 0.044807 4
47 (milk) (water seltzer sparkling water) 0.982703 0.044406 4
53 (fresh fruits) (ice cream ice) 1.051474 0.067365 5
56 (water seltzer sparkling water) (milk) 0.951843 0.045901 5
57 (soft drinks) (fresh fruits) 0.821515 0.044320 5
58 (fresh fruits) (candy chocolate) 1.042333 0.042249 5
61 (ice cream ice) (fresh fruits) 1.034761 0.080281 6
62 (ice cream ice) (fresh vegetables) 0.990400 0.066600 6
63 (packaged vegetables fruits) (ice cream ice) 1.038094 0.054874 6
65 (cereal) (fresh vegetables) 1.023846 0.048896 6
66 (fresh vegetables) (juice nectars) 1.028017 0.048513 6
67 (milk) (water seltzer sparkling water) 1.011483 0.047670 6
69 (frozen meals) (fresh vegetables) 0.975458 0.041999 6
In [159]:
mapping_dow = pd.read_csv('Mapping_dow.csv')
In [160]:
df_dow = pd.merge(dfs_subs_selected,mapping_dow, how='left')
df1_dow = pd.merge(df_cmpt_dow,mapping_dow, how='left')
In [161]:
df_dow = df_dow[~(df_dow.antecedents.str.len() > 1)]
df_dow['antecedents'] = [list(x)[0] for x in df_dow['antecedents']]
df_dow['consequents'] = [list(x)[0] for x in df_dow['consequents']]
In [162]:
df1_dow.shape
Out[162]:
(11, 7)
In [163]:
df1_dow = df1_dow[~(df1_dow.antecedents.str.len() > 1)]
df1_dow['antecedents'] = [list(x)[0] for x in df1_dow['antecedents']]
df1_dow['consequents'] = [list(x)[0] for x in df1_dow['consequents']]
In [164]:
df1_dow
Out[164]:
antecedents consequents lift confidence support index Day
1 fresh herbs fresh vegetables 1.951179 0.854479 0.074706 1 Monday
3 fresh herbs fresh vegetables 2.007477 0.832885 0.068153 2 Tuesday
4 fresh herbs fresh vegetables 2.077131 0.829477 0.064238 3 Wednesday
6 fresh herbs fresh vegetables 2.082811 0.826986 0.070433 4 Thursday
8 fresh herbs fresh vegetables 2.023854 0.819616 0.072373 5 Friday
In [165]:
fig = px.parallel_categories(df_dow, dimensions=['Day','antecedents','consequents'],
                color="support", color_continuous_scale=px.colors.sequential.Inferno,
                labels={'Day':'Day','antecedents':'antecedents','consequents':'consequents'})#'antecedents':'antecedents'})
fig.update_layout(legend = dict(bgcolor = '#fff0db'))
fig.show()
In [166]:
df_dow
Out[166]:
antecedents consequents lift support index Day
0 fresh vegetables refrigerated 1.034967 0.077199 0 Sunday
1 ice cream ice fresh fruits 1.022712 0.074406 0 Sunday
2 fresh vegetables ice cream ice 1.003870 0.063801 0 Sunday
3 cereal fresh vegetables 1.003836 0.053652 0 Sunday
4 milk water seltzer sparkling water 0.994827 0.049205 0 Sunday
5 fresh vegetables juice nectars 1.026134 0.049205 0 Sunday
6 frozen meals fresh vegetables 0.984349 0.045613 0 Sunday
7 milk soy lactosefree 0.872122 0.041279 0 Sunday
8 milk water seltzer sparkling water 1.039780 0.050425 1 Monday
9 energy granola bars fresh vegetables 0.973196 0.042448 1 Monday
10 candy chocolate fresh fruits 1.016374 0.042215 1 Monday
11 fresh fruits ice cream ice 1.067395 0.055541 2 Tuesday
12 water seltzer sparkling water milk 0.987431 0.044946 2 Tuesday
13 candy chocolate fresh fruits 1.035930 0.040804 2 Tuesday
14 fresh fruits ice cream ice 1.062251 0.057499 3 Wednesday
15 yogurt water seltzer sparkling water 1.097309 0.050721 3 Wednesday
16 ice cream ice fresh vegetables 1.038007 0.043163 3 Wednesday
17 milk water seltzer sparkling water 0.959463 0.041527 3 Wednesday
18 fresh fruits ice cream ice 1.052562 0.057300 4 Thursday
19 fresh fruits baking ingredients 1.098130 0.044807 4 Thursday
20 milk water seltzer sparkling water 0.982703 0.044406 4 Thursday
21 fresh fruits ice cream ice 1.051474 0.067365 5 Friday
22 water seltzer sparkling water milk 0.951843 0.045901 5 Friday
23 soft drinks fresh fruits 0.821515 0.044320 5 Friday
24 fresh fruits candy chocolate 1.042333 0.042249 5 Friday
25 ice cream ice fresh fruits 1.034761 0.080281 6 Saturday
26 ice cream ice fresh vegetables 0.990400 0.066600 6 Saturday
27 packaged vegetables fruits ice cream ice 1.038094 0.054874 6 Saturday
28 cereal fresh vegetables 1.023846 0.048896 6 Saturday
29 fresh vegetables juice nectars 1.028017 0.048513 6 Saturday
30 milk water seltzer sparkling water 1.011483 0.047670 6 Saturday
31 frozen meals fresh vegetables 0.975458 0.041999 6 Saturday

6.3.2 TOFD per DOW Results

In [167]:
rConf_list_tofd, rLift_list_tofd, cmpt_list_tofd, subs_list_tofd = ACL_per_split(df_tofd_list, ms=0.04, cmptc=0.7, cmptl=1.7)
In [168]:
# Lowest 10 Lift values selected, then ranked by Support
df_subs_tofd = pd.DataFrame()
for i in range(0,len(subs_list_tofd)):
    df_insert = subs_list_tofd[i].iloc[:10,:][["antecedents","consequents","lift",'support']]
    df_insert["index"] = i
    df_subs_tofd = df_subs_tofd.append(df_insert)
# Select even (or odd) number index value to exclude opposite direction duplicates, then rank by support.
dfs_subs_toft_final = df_subs_tofd.iloc[::2].sort_values(by=['index','support'], ascending=[True,False])
In [169]:
df_cmpt_tofd = pd.DataFrame()
for i in range(0,len(cmpt_list_tofd)):
    df_insert = cmpt_list_tofd[i].iloc[:5,:][["antecedents","consequents","lift"]]
    df_insert["index"] = i
    df_cmpt_tofd = df_cmpt_tofd.append(df_insert)
df_cmpt_tofd
Out[169]:
antecedents consequents lift index
403 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.731021 0
309 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.715497 0
407 (packaged vegetables fruits, fresh herbs) (fresh fruits, fresh vegetables) 1.994416 0
367 (packaged vegetables fruits, canned jarred veg... (fresh fruits, fresh vegetables) 1.811073 0
215 (fresh herbs) (fresh fruits, fresh vegetables) 1.776482 0
339 (packaged vegetables fruits, fresh herbs) (fresh fruits, fresh vegetables) 1.929463 1
409 (packaged vegetables fruits, milk, packaged ch... (fresh fruits, fresh vegetables) 1.804405 1
422 (packaged vegetables fruits, packaged cheese, ... (fresh fruits, fresh vegetables) 1.766705 1
185 (fresh herbs) (fresh fruits, fresh vegetables) 1.757304 1
198 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.848952 2
139 (fresh fruits, fresh herbs) (fresh vegetables) 1.833504 2
56 (fresh herbs) (fresh vegetables) 1.766410 2
136 (packaged vegetables fruits, canned meals beans) (fresh vegetables) 1.745270 3
348 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.743526 3
271 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.732677 3
186 (fresh fruits, fresh herbs) (fresh vegetables) 1.723440 3
310 (fresh fruits, bread, packaged cheese) (fresh vegetables) 1.704365 3
152 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 2.232582 4
102 (fresh fruits, fresh herbs) (fresh vegetables) 2.196718 4
48 (fresh herbs) (fresh vegetables) 2.139521 4
8 (canned jarred vegetables) (fresh vegetables) 1.931874 4
185 (fresh fruits, packaged vegetables fruits, eggs) (fresh vegetables) 1.884278 4
157 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.927624 5
106 (fresh fruits, fresh herbs) (fresh vegetables) 1.915548 5
78 (fresh fruits, canned jarred vegetables) (fresh vegetables) 1.838521 5
49 (fresh herbs) (fresh vegetables) 1.826870 5
181 (fresh fruits, packaged vegetables fruits, eggs) (fresh vegetables) 1.742642 5
137 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 2.013626 6
95 (fresh fruits, fresh herbs) (fresh vegetables) 2.010089 6
44 (fresh herbs) (fresh vegetables) 1.932122 6
70 (fresh fruits, canned jarred vegetables) (fresh vegetables) 1.920808 6
161 (fresh fruits, packaged vegetables fruits, eggs) (fresh vegetables) 1.750280 6
249 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.932964 7
199 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.928609 7
138 (fresh fruits, fresh herbs) (fresh vegetables) 1.905584 7
63 (fresh herbs) (fresh vegetables) 1.857200 7
214 (packaged vegetables fruits, soup broth bouillon) (fresh vegetables) 1.849246 7
74 (fresh fruits, fresh herbs) (fresh vegetables) 2.263005 8
40 (fresh herbs) (fresh vegetables) 2.194952 8
6 (canned jarred vegetables) (fresh vegetables) 1.917700 8
70 (packaged vegetables fruits, eggs) (fresh vegetables) 1.903032 8
134 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.868600 8
117 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 2.050336 9
78 (fresh fruits, fresh herbs) (fresh vegetables) 1.987173 9
42 (fresh herbs) (fresh vegetables) 1.900707 9
8 (canned jarred vegetables) (fresh vegetables) 1.769362 9
153 (fresh fruits, packaged vegetables fruits, soy... (fresh vegetables) 1.762150 9
77 (fresh fruits, fresh herbs) (fresh vegetables) 2.138016 10
38 (fresh herbs) (fresh vegetables) 2.065805 10
134 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.775627 10
4 (canned jarred vegetables) (fresh vegetables) 1.775095 10
71 (packaged vegetables fruits, eggs) (fresh vegetables) 1.749119 10
198 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.984597 11
167 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.940292 11
116 (fresh fruits, fresh herbs) (fresh vegetables) 1.860194 11
194 (fresh fruits, packaged vegetables fruits, eggs) (fresh vegetables) 1.756168 11
202 (packaged vegetables fruits, fresh herbs) (fresh fruits, fresh vegetables) 2.410898 11
59 (fresh fruits, fresh herbs) (fresh vegetables) 2.428420 12
34 (fresh herbs) (fresh vegetables) 2.356614 12
5 (canned jarred vegetables) (fresh vegetables) 1.998443 12
105 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.972751 12
71 (fresh fruits, fresh herbs) (fresh vegetables) 2.053554 13
40 (fresh herbs) (fresh vegetables) 1.929787 13
136 (fresh fruits, packaged vegetables fruits, soy... (fresh vegetables) 1.793259 13
132 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.770674 13
5 (canned jarred vegetables) (fresh vegetables) 1.762894 13
50 (fresh fruits, fresh herbs) (fresh vegetables) 2.174268 14
32 (fresh herbs) (fresh vegetables) 2.070335 14
106 (fresh fruits, yogurt, packaged cheese) (fresh vegetables) 1.891151 14
102 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.858181 14
200 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.994688 15
169 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.949869 15
117 (fresh fruits, fresh herbs) (fresh vegetables) 1.930295 15
86 (fresh fruits, canned jarred vegetables) (fresh vegetables) 1.900341 15
54 (fresh herbs) (fresh vegetables) 1.828607 15
65 (fresh fruits, fresh herbs) (fresh vegetables) 2.329031 16
38 (fresh herbs) (fresh vegetables) 2.238431 16
5 (canned jarred vegetables) (fresh vegetables) 2.005587 16
63 (packaged vegetables fruits, eggs) (fresh vegetables) 1.953102 16
115 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.922861 16
69 (fresh fruits, fresh herbs) (fresh vegetables) 2.206896 17
105 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 2.188413 17
38 (fresh herbs) (fresh vegetables) 2.074923 17
5 (canned jarred vegetables) (fresh vegetables) 1.896919 17
126 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.843388 17
105 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 2.142851 18
69 (fresh fruits, fresh herbs) (fresh vegetables) 2.037905 18
35 (fresh herbs) (fresh vegetables) 1.983773 18
6 (canned jarred vegetables) (fresh vegetables) 1.785187 18
124 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.757997 18
227 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.900733 19
129 (fresh fruits, fresh herbs) (fresh vegetables) 1.870218 19
191 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.862367 19
57 (fresh herbs) (fresh vegetables) 1.801171 19
231 (packaged vegetables fruits, fresh herbs) (fresh fruits, fresh vegetables) 2.248287 19
89 (fresh fruits, fresh herbs) (fresh vegetables) 2.206483 20
133 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 2.191265 20
47 (fresh herbs) (fresh vegetables) 2.108333 20
135 (frozen produce, packaged vegetables fruits) (fresh vegetables) 1.820833 20
17 (dry pasta) (fresh vegetables) 1.813908 20
123 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 2.163613 21
83 (fresh fruits, fresh herbs) (fresh vegetables) 2.086045 21
44 (fresh herbs) (fresh vegetables) 2.024409 21
9 (canned jarred vegetables) (fresh vegetables) 1.846244 21
142 (fresh fruits, packaged vegetables fruits, bread) (fresh vegetables) 1.806371 21
78 (fresh fruits, fresh herbs) (fresh vegetables) 2.128246 22
42 (fresh herbs) (fresh vegetables) 1.988071 22
131 (fresh fruits, packaged vegetables fruits, milk) (fresh vegetables) 1.816571 22
138 (fresh fruits, packaged vegetables fruits, pac... (fresh vegetables) 1.783271 22
119 (packaged vegetables fruits, milk) (fresh vegetables) 1.738189 22
240 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.894543 23
201 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.863641 23
139 (fresh fruits, fresh herbs) (fresh vegetables) 1.838895 23
61 (fresh herbs) (fresh vegetables) 1.749123 23
103 (fresh fruits, canned jarred vegetables) (fresh vegetables) 1.719157 23
274 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.852250 24
222 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.822543 24
161 (fresh fruits, fresh herbs) (fresh vegetables) 1.820905 24
114 (fresh fruits, canned jarred vegetables) (fresh vegetables) 1.748041 24
64 (fresh herbs) (fresh vegetables) 1.745953 24
242 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.826093 25
197 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.797919 25
137 (fresh fruits, fresh herbs) (fresh vegetables) 1.751969 25
99 (fresh fruits, canned jarred vegetables) (fresh vegetables) 1.720199 25
246 (packaged vegetables fruits, fresh herbs) (fresh fruits, fresh vegetables) 2.142401 25
158 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.956118 26
109 (fresh fruits, fresh herbs) (fresh vegetables) 1.937730 26
81 (fresh fruits, canned jarred vegetables) (fresh vegetables) 1.872417 26
53 (fresh herbs) (fresh vegetables) 1.834945 26
169 (packaged vegetables fruits, soup broth bouillon) (fresh vegetables) 1.776813 26
129 (packaged vegetables fruits, canned jarred veg... (fresh vegetables) 1.788134 27
342 (fresh fruits, packaged vegetables fruits, fre... (fresh vegetables) 1.765665 27
283 (packaged vegetables fruits, fresh herbs) (fresh vegetables) 1.747614 27
201 (fresh fruits, fresh herbs) (fresh vegetables) 1.720308 27
156 (condiments, fresh fruits) (fresh vegetables) 1.701404 27
In [170]:
# some days don't have products that meet the criteria for complements parameter, hence the index is missing for those.
In [171]:
mapping = pd.read_csv('Mapping.csv')
In [172]:
df = pd.merge(df_subs_tofd,mapping, how='left')
df1 = pd.merge(df_cmpt_tofd,mapping, how='left')
In [173]:
df = df[~(df.antecedents.str.len() > 1)]
df['antecedents'] = [list(x)[0] for x in df['antecedents']]
df['consequents'] = [list(x)[0] for x in df['consequents']]
In [174]:
df1.shape
Out[174]:
(135, 6)
In [175]:
df1 = df1[~(df1.antecedents.str.len() > 1)]
df1['antecedents'] = [list(x)[0] for x in df1['antecedents']]
df1['consequents'] = [list(x)[0] for x in df1['consequents']]
In [176]:
fig = px.parallel_categories(df, dimensions=['Day', 'Time','antecedents','consequents'],
                color="lift", color_continuous_scale=px.colors.sequential.Inferno,
                labels={'Day':'Day', 'Time':'Time','antecedents':'antecedents','consequents':'consequents'})#'antecedents':'antecedents'})
fig.update_layout(legend = dict(bgcolor = '#fff0db'))
fig.show()

7. Deployment

7.1 Shopping Cart Recommender

In [177]:
def Recommender_System(user_id):
    
    '''
    enter user_id and return a list of 5 recommendations.
    '''
    
    u = high_volume.groupby(['user_id','product_name']).size().sort_values(ascending=False).unstack().fillna(0)
    u_sim = pd.DataFrame(cosine_similarity(u), index=u.index, columns=u.index)

    p = high_volume.groupby(['product_name','user_id']).size().sort_values(ascending=False).unstack().fillna(0)
    
    recommendations = pd.Series(np.dot(p.values,cosine_dists[user_id]), index=p.index)
    return recommendations.sort_values(ascending=False).head()
In [178]:
# get the list of orders that have been reordered before
reorders = order_products[order_products['reordered'] == 1]
orders2 = orders[['order_id', 'user_id']]
reorders = pd.merge(reorders, products[['product_id', 'product_name']], how='left', on=['product_id'])
# merge to get user_id and product_id
user_orders = reorders.merge(orders2, on='order_id')
# filtering out the high volumn products that user reordered more than once
user_orders['high_volume'] = (user_orders['product_id'].value_counts().sort_values(ascending=False)>1)
high_volume = user_orders[user_orders['high_volume'] == True]
# get a matrix of different high volume items that particular user purchased
high_volume_users = high_volume.groupby(['user_id', 'product_name']).size().sort_values(ascending=False).unstack().fillna(0)
# calculate similarity between each user
cosine_dists = pd.DataFrame(cosine_similarity(high_volume_users),index=high_volume_users.index, columns=high_volume_users.index)
cosine_dists.head()
Out[178]:
user_id 5430 30618 39332 41591 45892 52745 57561 62481 67155 81461 ... 100845 126892 135442 143742 172653 178268 181484 189943 189954 190709
user_id
5430 1.000000 0.000000 0.223607 0.000000 0.149071 0.0 0.0 0.000000 0.158114 0.0 ... 0.258199 0.447214 0.000000 0.000000 0.248069 0.000000 0.182574 0.000000 0.000000 0.200000
30618 0.000000 1.000000 0.000000 0.188982 0.166667 0.0 0.0 0.000000 0.000000 0.0 ... 0.000000 0.000000 0.277350 0.000000 0.277350 0.000000 0.000000 0.000000 0.500000 0.000000
39332 0.223607 0.000000 1.000000 0.566947 0.333333 0.0 0.0 0.000000 0.000000 0.0 ... 0.000000 0.500000 0.416025 0.000000 0.138675 0.500000 0.408248 0.353553 0.000000 0.000000
41591 0.000000 0.188982 0.566947 1.000000 0.377964 0.0 0.0 0.000000 0.267261 0.0 ... 0.000000 0.000000 0.524142 0.000000 0.419314 0.377964 0.308607 0.267261 0.251976 0.000000
45892 0.149071 0.166667 0.333333 0.377964 1.000000 0.0 0.0 0.333333 0.000000 0.0 ... 0.000000 0.000000 0.462250 0.149071 0.277350 0.333333 0.408248 0.000000 0.222222 0.149071

5 rows × 25 columns

In [179]:
# recommendation for customer id=30618
Recommender_System(30618)
Out[179]:
product_name
fresh fruits              6.656982
butter                    2.595445
yogurt                    1.668070
fresh vegetables          1.596343
hot dogs bacon sausage    1.547723
dtype: float64
In [ ]:
 
In [ ]:
 
In [ ]: